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Abstract 

This  paper  develops  a  model  of  quantile  treatment  effects  with  treatment  endo- 
geneity.  The  model  primarily  exploits  similarity  assumption  as  a  main  restriction  that 
handles  endogeneity.  From  this  model  we  derive  a  Wald  IV  estimating  equation,  and 
show  that  the  model  does  not  require  functional  form  assumptions  for  identification. 

We  then  characterize  the  quantile  treatment  function  as  solving  an  "inverse"  quan- 
tile regression  problem  and  suggest  its  finite-sample  analog  as  a  practical  estimator. 
This  estimator,  unlike  generalized  method-of-moments,  can  be  easily  computed  by  solv- 
ing a  series  of  conventional  quantile  regressions,  and  does  not  require  grid  searches  over 
high-dimensional  parameter  sets.  A  properly  weighted  version  of  this  estimator  is  also 
efficient.  The  model  and  estimator  apply  to  either  continuous  or  discrete  variables.  We 
apply  this  estimator  to  characterize  the  median  and  other  quantile  treatment  effects  in 
a  market  demand  model  and  a  job  training  program. 
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Regression,  Treatment  Effects,  Empirical  Likelihood,  Training,  Demand  Models. 
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1      Introduction 

The  ability  of  quantile  regression  models,  Koenker  and  Bassett  (1978),  to  characterize 
the  impact  of  variables  on  the  distribution  of  outcomes  makes  them  appealing  for  exam- 
ining many  economic  applications,  see  e.g.  Buchinsky  (1998)  and  Koenker  and  Hallock 
(2001).  The  distributional  impacts  of  social  programs,  such  as  welfare,  unemployment 
insurance,  and  training  programs  are  of  large  interest  to  economists.  Unfortunately, 
in  all  of  these  cases,  treatment  is  self-selected  or  endogenous,  making  conventional 
quantile  regression  inappropriate.  This  paper  makes  two  contributions. 

First,  this  paper  proposes  a  model  of  quantile  treatment  effects  with  endogeneity. 
At  the  heart  of  the  model  is  an  assumption  of  similarity  (containing  rank  invariance  as  a 
special  case)  that  allows  us  to  address  endogeneity.  This  differs  from  the  monotonicity 
assumptions  of  Heckman's  nonparametric  selection  model  and  Imbens  and  Angrist's 
LATE  model.2  We  show  that  this  model's  main  implication  is  a  Wald  IV  estimating 
equation: 

P(Y<q(D,T)\Z)  =  r,  (1) 

where  q(d,r)  is  the  T-quantile  of  the  potential  or  counterfactual  outcome  when  the 
treatment  is  exogenously  set  to  the  value  d,  D  is  the  actual  endogenous  treatment, 
Y  is  the  actual  outcome,  and  Z  is  an  instrument.  Thus,  the  model  provides  a  causal 
justification  and  interpretation  of  the  Wald  IV  estimating  equation  (l).3  We  also  show 
that  the  model  does  not  require  functional  form  assumptions  for  identification. 

Second,  we  characterize  the  function  q  as  solving  an  inverse  quantile  regression 
problem  and  suggest  its  finite-sample  analog  as  a  practical  estimator.  This  estima- 
tor, unlike  generalized  method-of-moments  and  other  similar  estimators,  can  be  easily 
computed  by  solving  a  series  of  conventional  quantile  regressions  (convex  optimization 
problems),  and  does  not  require  grid  searches  over  high-dimensional  parameter  sets.  A 
properly  weighted  version  of  this  estimator  is  also  efficient.  We  apply  this  estimator 
to  characterize  the  median  and  other  quantile  treatment  effects  in  a  market  demand 
model  and  a  job  training  program. 

An  important  aspect  of  the  proposed  model  is  treatment  effect  heterogeneity,  given 
which  conventional  linear  IV  inconsistently  estimate  the  average  treatment  effect  (ex- 
ample 5.1).  Thus,  even  if  one  is  interested  in  ATE,  one  has  to  estimate  the  QTE's  first 
and  integrate  them  over  the  quantile  index  to  obtain  a  consistent  estimate  of  ATE. 
Alternatively,  one  may  estimate  only  the  median  treatment  effects  to  characterize  the 
central  effects,  using  the  proposed  approach.4 


2See  Vytlacil  (2001)  on  the  distributional  equivalence  of  these  two  models. 

3There  is  very  important  prior  work  that  estimates  functions  q  under  restrictions  like  (1).  This 
work  starts  as  early  as  Hogg  (1975);  Koenker  (1998)  characterizes  Hogg's  estimator  as  a  Wald's  IV 
approach  to  quantile  regression.  See  Abadie  (1995),  Christoffersen,  Hahn,  and  Inoue  (1999),  MaCurdy 
and  Timmins  (1998)  for  GMM-like  approaches  to  estimation  and  testing.  Also  see  Hong  and  Tamer 
(2001)  for  a  fundamental  treatment  of  censoring  case.  The  problem  is  that  a  function  q,  satisfying 
the  estimating  equation,  has  not  had  any  known  causal  meaning  within  standard  IV  models  with 
non-constant  treatment  effects,  such  as  those  of  Heckman  or  Imbens  and  Angrist.  Thus,  our  model 
provides  a  causal  interpretation  and  support  of  (1)  and  of  these  previous  important  estimators. 

4  In  expected  utility  framework,  some  form  of  average  is  typically  of  interest,  but  since  we  (econome- 


Further  details  of  the  model  and  the  estimator  are  as  follows.  The  model  is  devel- 
oped in  the  standard  potential  outcomes  framework,  and  the  QTE  is  defined  as  the 
difference  in  the  quantiles  of  potential  outcomes  under  potential  treatments.  At  the 
heart  of  the  model  is  similarity,  a  generalization  of  rank  invariance  assumption,  which 
is  reasonable  in  many  applications  and  also  facilitates  interesting  interpretations  of 
QTE,  as  in  Lehmann(1974),  Doksum  (1974),  and  Koenker  and  Geling  (2001).  This 
assumption  is  different  from  the  monotonicity  assumptions  of  the  prevalent  IV  models 
(the  selection  -  LATE  models).  Similarity  also  requires  less  stringent  independence 
conditions  (allowing,  for  example,  measurement  error  in  the  instrument).  As  a  result 
the  model  differs  from  the  selection-LATE  models.  However,  the  two  models  do  contain 
a  large  common  subclass. 

It  should  also  be  noted  that  the  model  and  estimators  looked  at  in  this  paper 
both  substantively  complement  and  differ  from  the  fundamental  model  of  Amemiya 
(1982)  and  the  QTE  model  of  Abadie,  Angrist,  and  Imbens  (2001)  developed  within  the 
LATE  framework.  Amemiya's  approach  and  its  extension  by  Chen  and  Portnoy  (1996), 
known  as  two-stage  quantile  regression  (2SQR),  allow  continuous  treatment  variables. 
However,  we  show  that  2SQR  is  not  consistent  when  the  quantile  treatment  effect 
differs  across  quantiles  (Appendix  B).  The  inconsistency  is  noted,  since  this  estimator 
has  often  been  used  expressly  to  estimate  heterogeneous  quantile  treatment  effects.  On 
the  other  hand,  Abadie  et  al's  (2001)  approach  applies  only  to  binary  treatments,  and 
its  extension  to  more  general  treatments  is  not  known.  Allowing  general  treatments  is 
clearly  important. 

The  approach  in  this  paper  expressly  allows  for  QTE  that  vary  across  quantiles 
and  applies  to  arbitrary-  continuous,  discrete,  or  binary  -  treatment  variables.  Thus 
it  can  be  used  to  study  education  effects,  demand  systems,  and  any  other  non-binary 
treatments. 

On  the  estimation  side,  the  inverse  quantile  regression  is  an  easily  computable  and 
transparent  estimator,  unlike  GMM  that  requires  grid  searches  over  high-dimensional 
parameter  sets.  The  estimator  is  obtained  as  a  link  between  Koenker-Bassett  quantile 
regression  and  the  Wald  IV  restrictions.  In  addition  to  deriving  theoretical  properties  of 
inverse  quantile  regression,  we  provide  user-friendly  computer  programs  that  implement 
the  estimator,  standard  errors,  and  produce  graphical  output. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2  presents  the  model. 
Section  3  provides  two  economic  models  as  examples:  aggregate  demand  analysis  and 
the  returns  to  education.  Section  4  presents  identification  results  and  the  inverse 
quantile  regression.  Estimation  methods  are  described  in  Section  5,  and  Section  6 
contains  two  empirical  applications,  corresponding  to  models  in  section  3. 

A  word  on  notation.  Following  Koenker,  we  use  Fy{-\x)  and  Qy{t\x)  to  denote  the 
conditional  distribution  function  and  the  r-quantile  of  Y  given  X  =  x;  capitals  such 
as  Y  denote  random  variables  and  y  denote  the  values  they  take. 


tricians)  typically  do  not  know  (are  agnostic  about)  which  particular  average,  the  entire  distributional 
impact  needs  to  be  evaluated.  In  addition,  Manski's(1988)  ingenious  work  provides  ordinal  utility  mod- 
els of  decision  making  under  uncertainty,  where  agents  maximize  a  r-quantile  of  utility  distribution. 
In  such  framework  only  quantiles  of  potential  outcomes  would  be  of  interest  to  a  policy-maker. 


2     A  Model  of  Quantile  Treatment  Effects 

The  section  begins  with  an  important  preliminary  discussion  that  naturally  leads  to 
the  QTE  model  of  this  paper. 

2.1  Potential  Outcomes  and  the  QTE 

We  develop  our  model  within  the  conventional  Neymann-Fisher-Rubin  potential  out- 
come framework.5  Potential  real-valued  outcomes  are  indexed  against  treatment  D 
(D  £  V,  a  subset  of  R'),  and  denoted  Yd,  while  potential  treatment  status  is  indexed 
against  the  instrument  Z,  and  denoted  Dz.  For  example,  Yd  is  an  individual's  outcome 
when  D  =  d  and  Dz  is  an  individual's  treatment  status  when  Z  =  z. 

The  potential  or  counterfactual  outcomes  {Yd,  d  €  V},  such  as  wages  or  demand, 
vary  across  individuals  or  states  of  the  world.  Given  the  actual  treatment  D,  the 
observed  outcome  is 

Y  =  YD. 

That  is,  only  the  Z?-th  component  of  {Yd,  d  £  V}  is  observed.  Typically  D  is  selected 
in  relation  to  potential  outcomes,  inducing  endogeneity  or  sample  selectivity. 

The  objective  of  causal  analysis  is  to  learn  about  the  features  of  marginal  distri- 
butions of  potential  outcomes  Yd.  For  example,  fid,d>  =  EYd  —  EYd>  is  the  average 
treatment  effect  (ATE).  The  quantile  treatment  effect  (QTE)  is  the  difference  in  quan- 
tiles  of  potential  outcomes  under  different  potential  treatments:6 

Qrd{r)-Qyd,{r). 

A  main  obstacle  to  learning  about  the  QTE  is  the  sample  selectivity  or  endogeneity. 

Early  formulations  of  QTE  by  Lehmann  (1974)  and  Doksum  (1974)  axiomatically 
interpret  QTE  as  a  a  measure  of  interaction  of  the  latent  ability  t  (  "prone  to  die  at 
an  early  age,  "prone  to  learn  fast,"  etc.)  and  the  treatment.  The  subjects  differ  in 
this  latent  characteristic  and  their  response  to  the  treatment  is  described  by  QTE.  An 
assumption  that  allows  such  interpretation  is  rank  invariance.  Rank  invariance  was 
also  used  by  Heckman  and  Smith  (1997)  and  Koenker  and  Bilias  (2001)  in  quantile 
models  without  endogeneity.7 

Our  model  uses  similarity  as  a  main  restriction  that  allows  to  address  endogeneity. 
Similarity  facilitates  analogous  interpretation  of  the  quantile  treatment  effects  in  our 
framework  and  incorporates  rank  invariance  as  a  special  case. 

2.2  The  Instrumental  Quantile  Treatment  Model. 

The  first  part  of  the  model  is  a  potential  outcomes  model.  The  other  part  relates  the 
treatment  choice  to  the  potential  outcomes,  accounting  for  endogeneity. 

5See  e.g.  Heckman  and  Robb  (1986)  and  Imbens  and  Angrist  (1994). 

6Generally,  QTE  are  more  informative  than  ATE,  since  they  summarize  the  distributional  im- 
pact, whereas  ATE  summarize  the  impact  on  the  first  moment  of  the  distribution.  In  fact, 
fd,d'=/o  {Qrdir)-QYdl(T))dT. 

7Heckman  and  Smith  (1997)  use  rank  invariance  to  identify  QYd-Y'(T)  —  Qyd{^)  —  Qyd,(T)- 


Assumption  1  (IQT  Model)  For  almost  every  value  of  (X,  Z)  =  (x,z), 
Al  Potential  Outcomes.  Given  X  =  x,  for  some  Ud  ~  U(0, 1), 

Yd  =  q{d,X,Ud), 
such  that  q(d,x,r)  is  the  r-th  quantile  of  Yd  for  any  0  <  r  <  1. 
A2  Selection.  For  unknown  function  5  and  random  process  V,  given  X  =  x,  Z  =  z, 

Dz  =  5{z,x,V). 

A3  Independence.    Given  X,   {Ud}  is  independent  of  Z. 
A4    Similarity.  For  each  d  and  d',  given  ( V,  X,  Z) 

Ud  is  equal  in  distribution  to  Ud<  ■ 

A5  Observed  variables  W  consist  of  (  for  U  =  UD) 

(        Y  =  q(D,X,U), 

\       D  =  S(Z,X,V), 

{  x,z. 

Remark  2.1  Of  interest  also  is  a  much  more  restrictive  special  case  of  A3  and  A4 

A3*  FULL  Independence.  {Ud,  V},  or  equivalently  {Yd,Dz},  are  jointly  independent 
of  Z,  given  X. 

A4*  Rank  Invariance.  Ud  =  Ud>  =  U  for  each  d. 

In  Al  the  conditional  r-  quantile  of  Yd  is  q(d,  x, r),  given  X  =  x.  Our  main  interest 
is  the  Conditional  Quantile  Treatment  Effect 

q(d,x,T)-q(d',x,r), 

the  difference  in  quantiles  of  potential  outcomes  distributions  conditional  on  x. 

In  A2,  the  unobserved  random  vector  V  is  responsible  for  the  difference  in  treat- 
ment choices  Dz  across  observationally  identical  individuals.  5()  is  the  (measurable) 
selection  function.  We  do  not  impose  any  other  assumptions  on  this  function.  This  is 
important  to  accomodate  realistic  economic  examples. 

A3  states  that  potential  outcomes  are  independent  of  Z,  given  X.  A3  is  more 
general  than  A3*,  the  assumption  of  selection-LATE  models,  that  requires  both  {Yd} 
and  the  potential  treatments  {Dz}  to  be  independent  of  the  instrument  Z.  A3*  is  a 
strong  assumption  that  can  be  easily  violated  when  the  instrument  is  measured  with 
error  or  there  are  omitted  variables  related  to  Z  (a  part  of  the  error  V)  in  the  selection 
equation.  Imbens  and  Angrist  (1994)  provide  additional  examples  violating  A3*. 


Given  the  same  observed  characteristic  x  and  treatment  d,  the  subjects  still  differ 
in  terms  of  potential  outcomes.  Their  relative  Tanking  is  determined  by  the  rank  or 
ability  vector  {Uj}.  This  vector  can  be  collapsed  to  a  single  variable  under  assumption 
A4*  -  the  rank  invariance  or  common  error  assumption. 

A4,  similarity,  states  that  given  the  information  (V,Z,X)  the  expectation  of  (any 
function  of)  U,i  does  not  vary  across  the  treatment  states  d.  In  other  words,  ex-ante 
the  ranks  are  "similar,"  while  ex-post  the  ranks  may  differ.  Thus  similarity  allows 
substantial  ex-post  slippage  in  the  ranks,  the  importance  of  allowing  which  was  shown 
by  Heckman  and  Smith  (1997).  See  Example  3.2  for  an  example. 

A4  also  facilitates  interpretation  of  the  QTE  as  a  measure  of  the  interaction  between 
the  latent  ex-ante  ability  t  and  the  treatment,  following  Doksum  (1974)  and  Koenker 
and  Bilias  (2001).  Additionally,  A4  is  a  key  identification  device,  leading  to  the  Wald 
restrictions  in  Section  4. 

Similarity  is  the  main  restriction  of  the  IQT  model.  It  is  absent  in  the  conven- 
tional LATE/selection  models.  However,  A4  enables  a  more  general  selection  function 
in  A2  that  requires  neither  the  monotonicity  assumption  or  stronger  independence  as- 
sumptions of  the  LATE  models.  Thus,  LATE  models  mainly  exploit  monotonicity  and 
stronger  independence  assumption  to  address  endogeneity,  while  the  present  approach 
uses  the  similarity  assumption.  The  value  of  one  versus  the  other  has  to  be  judged  in 
each  particular  application. 

2.3     A  Comparison  with  a  LATE  Model  with  Common  Error. 

Although  the  IQT  model  differs  from  selection-LATE  model,  the  two  do  contain  a  large 
common  subclass.  Indeed,  consider  the  following  model: 

vi  Yd  =  g{ud{X),U),  de{o,i}. 

V2  Dz  =  \{d(z,X)  >  V)  for  some  real- valued  function  d. 
V3  {Yd,  V}  (or  {Yd,Dz})  are  independent  of  Z,  given  X. 
V4  U  does  not  vary  across  potential  treatments  d. 

Assume  that  g  is  monotone  in  U,  so  that  error  U  can  be  normalized  to  be  uniform. 
This  model  is  a  special  case  of  the  IQT  model,  with  assumption  V4  corresponding  to 
exact  rank  invariance  or  common  error  assumption  A4*  (see  Doksum  (1974),  Robins 
and  Tsiatis  (1991),  Heckman  and  Smith  (1997),  and  Vytlacil  (2000)  for  various  jus- 
tifications of  rank  invariance),  and  V3  being  a  stronger  version  of  the  independence 
assumption  A3,  in  fact  corresponding  to  A3*.  Vytlacil  (2000)  shows  such  a  model 
incorporates  a  wide  variety  of  familiar  nonlinear  simultaneous  equations  models.  In 
turn,  the  IQT  model  incorporates  the  model  V1-V4  as  an  important  special  case. 


3     Economic  Examples 

The  following  examples  highlight  the  nature  of  the  IQT  model.  The  discussion  is  quite 
thorough  because  it  underlies  the  empirical  applications  in  Section  6. 

Example  3.1  (Demand  with  Non-Separable  Error)  The  following  is  a  general- 
ization of  the  classic  supply-demand  example.  Consider  the  "random  coefficient"  model 

(        i.         Yp  =  q(p,U), 

I       ii.      Yp  =  p(p,z,U),  (2) 

{        iii.      P    e{p:q(p,Z,U)  =  p(p,U)}- 

The  map  p  >— >  Yp  is  the  random  demand  function,  that  is,  it  is  the  potential  demand 
when  the  price  is  set  (externally)  to  the  value  p.  Likewise,  p  >— >  Yp  is  the  random 
supply  function,  that  is  the  potential  supply  when  the  price  is  set  (externally)  to  p. 
Additionally,  Yp  and  Yp,  q(-),  and  p(-)  depend  on  the  covariates  X ,  but  this  dependence 
is  suppressed.  Random  variable  U  is  the  level  of  the  demand  in  the  sense  that  (p,  U)  < 
q(p,U')  when  U  <  U'.  Demand  is  maximal  when  U  =  1  and  minimal  when  U  =  0, 
holding  p  fixed.  Likewise,  hi  is  the  level  of  supply.  The  r-quantile  of  the  demand  curve 
p  >— >  Yp  is  given  by 

P,-*Qvp(-r)  =  g(p,r). 

Thus  with  probability  t,  the  curve  p  i->  Yp  lies  below  the  curve  p  i— >  Qyp(t)- 

The  quantile  treatment  effect  is  characterized  by  an  elasticity  d  In  q(p,r)/d  In  p.  The 
elasticity  depends  on  the  state  of  the  demand  r  (low  or  high)  and  may  vary  with  r.  For 
example,  this  variation  could  arise  when  the  number  of  buyers  varies  and  aggregation 
induces  non-constant  elasticity  across  the  demand  levels  as  a  process  of  summation  of 
individual  demand  curves,  holding  the  price  fixed. 

This  model  incorporates  many  traditional  models  with  separable  error 

Yp  =  q(p)  +  £,  where  £  =  F-\U).  (3) 

The  model  i.  is  much  more  general  in  that  the  price  can  affect  the  entire  distribution 
of  the  demand  curve,  while  in  (2)  it  only  affects  the  location  of  the  distribution  of  the 
demand  curve. 

Condition  iii.  is  the  equilibrium  condition  that  generates  endogeneity  -  the  selection 
of  the  actual  price  by  the  market  depends  on  the  potential  demand  and  supply  outcomes 
i.  and  ii.  As  a  result  P  =  S(Z,V),  where  V  consists  of  U,  IA,  and  other  variables 
(including  "sunspot"  variables,  if  the  equilibrium  price  is  not  unique).  Thus  what  we 
observe  can  be  written  as  simultaneous  equations  of  a  general  form,  with  observables8 

Y  =  q(P,U), 

P  =  S(Z,V).  w 


8To  appreciate  the  generality,  note  that  model  incorporates,  for  example,  the  simultaneous  equa- 
tions model  of  Imbens  and  Newey(2001),  who  assume  that  V  is  univariate,  S  is  monotone  in  V,  both 
V  and  U  are  independent  of  Zy  if  in  addition  we  assume  U  is  uniform.  Imbens  and  Newey  (2001) 
developed  some  ingenious  identification  results  using  these  stronger  assumptions. 
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Because  of  endogeneity,  <5vip(t)  7^  ?(-P. T)>  therefore  the  conventional  quantile  regres- 
sion will  be  inappropriate  to  estimate  the  r-th  quantile  demand  curve.  Additionally, 
we  show  in  Appendix  A  that  2SQR  is  generally  not  suitable  for  estimation  purposes. 
We  show  that  the  instrumental  variables  Z,  like  weather  conditions,  that  shift  the 
supply  curve  and  do  not  affect  the  level  of  the  demand  curve  U  allows  identification 
of  the  r-quantile  of  the  demand  function,  p  >— >  q(p,r).  Furthermore,  the  IQT  model 
allows  arbitrary  correlation  between  Z  and  V.  This  allows,  for  example,  measurement 
error  in  Z  (e.g.  in  weather  conditions).  The  standard  IV  approaches  (Heckman  et  al 
(2001),  Imbens  and  Angrist  (1994))  do  not  accommodate  such  a  possibility. 

Example  3.2  (Education/Training  Returns)  Let  "earnings"  in  the  "education" 
states  d  6  {0, 1}  be  determined  by  a  "random  coefficients"  model 

Y1=q1(X,U1),    Y0  =  qo(X,U0). 

An  individual's  training  or  education  decision  is  given  by 

D=1MZJ,V)>0) 

where  unobserved  vector  V  potentially  depends  on  (but  is  not  necessarily  determined 
by)  the  ability  vector  (f/j,t/0)  and  arbitrarily  on  functions  q\  and  q0,  X  and  Z.  The 
first  kind  of  dependence  is  endogeneity. 

In  the  standard  Roy  model,  no  restrictions  are  placed  on  the  individual  specific 
variations  in  earnings,  and  the  individual  observes  these  before  making  the  schooling 
choice.  For  identification,  we  impose  similarity:  conditional  on  (Z,X,V),  £/,  equals  in 
distribution  to  U0.  This  is  more  restrictive  than  the  general  case,  but  perhaps  not  as 
restrictive  as  it  may  appear.  This  restriction  allows  arbitrary  correlation  between  Yq 
and  Y\  and  allows  the  general  treatment  impacts  through  the  q\{-)  and  qo(-)  functions. 
A  main  difference  between  this  model  and  the  Roy  model  is  the  implicit  ex  ante  nature 
of  the  decision  process.  Instead  of  knowing  the  exact  outcomes  in  any  state  of  the 
world,  the  subject  anticipates  the  same  distribution  of  ability  across  treatment  states 
and  makes  the  decision  accordingly.9 

Indeed,  consider  a  simple  example  that  satisfies  similarity  A4: 

U0  =  r)  +  v0,    Ui=r]  +  vi, 

where  77  is  a  function  of  error  vector  V  in  the  selection  equation,  and  u0  and  and  v\ 
are  the  slippage  terms  such  that  vq  ~  v\  given  (X,Z,V).  Rank  invariance  A4*  is  a 
degenerate  case  when  v\  =  uq  =  0. 

Finally  note  that  the  similarity  only  need  hold  conditional  on  Z,  X ,  and  V.  This 
seems  to  be  a  reasonable  framework.  For  example,  people  generally  decide  on  whether 
to  attend  college  or  not  before  they  observe  their  rank/ability  among  college  edu- 
cated and  non-college  educated  individuals  with  observationally  identical  characteris- 
tics.  Thus,  it  seems  a  plausible  approximation  that  they  would  anticipate  the  same 


9More  precisely,  we  assume  that  he  has  enough  information  only  to  anticipate  the  same  distribution 
of  ability  across  states.  The  assumption  does  not  require  the  subject  to  have  correct  beliefs. 


distribution  of  their  rank/ability  across  the  treatment  states  relative  to  similar 
individuals  (with  the  same  covariates  X  and  Z). 

Another  difference  with  conventional  IV  model  is  that  the  IQT  model  expressly 
allows  for  dependence  to  exist  between  the  instrument  Z  and  V  whereas  the  standard 
approaches  expressly  disallow  this,  as  mentioned  in  the  previous  example.  E.g.,  consider 
the  following  simple  schooling  decision  rule 

D=  l{<p(Z)  +  V  >0}. 

In  the  schooling  or  training  context,  if  Z  is  a  family  background,  it  may  be  measured 
with  a  sizable  error,  so  independence  between  Z  and  V  need  not  hold.  V  could  also 
capture  omitted  variables  which  are  correlated  to  Z  and  impact  the  schooling  decision 
but  not  the  outcome.  Note  that  measurement  error  or  omitted  variables  also  violate 
the  monotonicity  assumption  often  used  in  the  IV  literature.  See  Imbens  and  Angrist 
(1994)  for  other  examples  of  violation. 

To  summarize,  three  aspects  of  the  proposed  model  are  highlighted  by  the  above 
examples.  First,  the  IQT  model  allows  arbitrarily  general  quantile  treatment  effects. 
The  similarity  assumption  in  no  way  restricts  their  shape.  Second,  under  similarity,  we 
can  interpret  the  QTE  as  measuring  the  interaction  between  the  latent  ex-ante  ability 
and  the  treatment,  following  Doksum  (1974)  and  Koenker  and  Bilias  (2001).  The 
similarity  seems  reasonable  in  many  settings.  Third,  the  similarity  allows  the  selection 
in  A2-A3  to  be  more  general  than  that  in  the  popular  IV  approaches,  although  this 
should  be  taken  as  a  subsidiary  point. 

4     Wald  IV  and  Inverse  Quantile  Regression 

Here  we  establish  a  link  between  the  IQT  model  and  the  Wald-type  IV  restrictions, 
relate  those  to  Koenker  and  Basset's  (1978)  quantile  regression,  and  show  that  the 
model  is  identified  without  functional  form  assumptions. 

4.1     Main  Identification  Restriction 

The  following  theorem  provides  provides  an  important  link  of  the  parameters  of  the 
IQT  model  to  the  Wald-type  IV  estimating  equations. 

Theorem  1  Suppose  A1-A5  hold,  and  given  X,Z 

i.  ifY  is  continuously  distributed  (q(D,X,r)  is  strictly  increasing  in  r  a.s.)  then  a.s. 

P[Y<q(D,X,r)\X,Z}  =  T, 
P[Y  <q{D,X,T)\X,Z]=T, 

it.  otherwise  (q(D,X,r)  is  non-decreasing  in  t,  a.s.),  a.s. 

P[Y<q(D,X,T)\X,Z}>r, 
P[Y<q(D,X,T)\X,Z]<T, 
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with  the  last  inequality  being  strict  if  q{D,  X,t')  =  q{D,X,r)  for  some  r'  >  r  with 
probability  P  >  0  given  X  and  Z . 

By  linking  the  IQT  model  to  Wald's  IV  quantile  restrictions,  Theorem  1  provides 
an  empirical  and  causal  content  to  these  restrictions.  In  this  regard,  the  IQT  model 
serves  the  same  purpose  as  the  LATE  model  developed  by  Imbens  and  Angrist  (1994) 
to  provide  the  link  between  the  Wald's  IV  approach  and  the  (local)  average  treat- 
ment effects.  However,  our  results  employ  the  similarity  assumption  in  place  of  the 
monotonicity  assumptions  to  obtain  this  link. 

As  noted,  Theorem  1  allows  for  an  arbitrary  variable  Y,  for  arbitrary  treatment 
variable  D,  and  arbitrary  instrument  Z.  Thus  equations  (5)  and  (6)  lead  to  natural 
ways  to  estimate  any  model  with  endogeneity  as  long  as  the  corresponding  quantiles 
of  the  potential  outcome  distribution  q{d,x1r)  may  be  specified.  We  focus  on  the 
continuous  Y,  but  the  discrete  case  is  clearly  relevant  -  see  e.g.  Manski  (1985),  Horowitz 
(1992),  Powell  (1986),  and  Hong  and  Tamer  (2001). 

Before  proceeding  further,  it  is  very  important  to  note  that  although  the  IQT  model 
allows  the  use  of  a  "conditioning  on  Z"  strategy  to  estimate  the  quantile  treatment 
effects,  it  is  not  possible  to  use  the  same  "conditioning  on  Z"  strategy  to  estimate  other 
treatment  effects  of  interest.  For  example,  in  order  to  estimate  the  average  treatment 
effect  within  the  IQT  model,  we  first  need  to  estimate  the  quantile  treatment  effects 
and  then  integrate  them  over  quantile  index  r.  Conventional  linear  IV  will  not  work 
here.  This  feature  is  analogous  to  that  in  the  selection-LATE  models  (Heckman  1990). 

Example  4.1  (Average  Treatment  Effects:  Failure  of  2SLS)  Within  A1-A5,  sup- 
pose /i(d)  is  finite  in  the  equation 

Yd  =  l*(d)  +  ed,    Eed  =  0, 

where  fi(d)  is  the  mean  treatment  function.     It  would  be  natural  to  expect  that 
E  [Y  —  fi{D)\Z]  =  0,  but  this  is  false  since  generally 

E[q(D,U)-n(D)\Z}?0, 

because  U  is  not  independent  of  D  conditional  on  Z  in  general,  so  that 

E  [Y  =  q(D,  U)\Z]  =11      q{d,  u)dP[D  =  d,U  =  u\Z\ 
J  J\o,\\ 

#  /  f     q(d,  u)dP[D  =  d\Z]  ■  dP[U  =  u\Z]  =  E  [fJ-(D)\Z] . 

J    J[0,1\ 

The  equality  holds  if  there  is  no  endogeneity  or  the  treatment  effect  is  constant. 

4.2     The  Inverse  Quantile  Regression 

The  main  identification  restriction  of  Theorem  1  can  be  posed  as  an  optimization 
problem,  which  we  call  the  inverse  quantile  regression  for  its  "inverse"  relation  to  the 


(conventional)  quantile  regression  of  Koenker  and  Bassett  (1978).  This  links  the  IQT 
model,  the  Wald  IV  restrictions,  and  quantile  regression  together. 

In  order  to  obtain  the  link,  we  note  that  Theorem  1  states  that  0  is  the  r-th  quantile 
of  random  variable  Y  —  q(X,D,r)  conditional  on  (X,Z).  Therefore,  the  problem  of 
finding  a  function  q(x,  d,  r)  satisfying  equations  (5)  or  (6)  is  the  problem  of  the  inverse 
quantile  regression: 

Find  a  function  q(x,  d,  r)  such  that  0  is  the  solution  to  the  quantile  regression 
problem,  in  which  we  regress  Y  —  q{X,D,r)  on  any  function  of  (Z,X). 

Theorem  2  formally  states  this  result. 

Theorem  2  For  P-a.e.  value  (x,z)  of  (X,Z),  the  following  are  equivalent  state- 
ments, for  each  measurable  q' 

1.  q'  satisfies  equation  (5)  or  (6)  (in  place  of  q). 

%■  Qc\x,z(T)  =  °.  where  e  =  Y-  q'(D,  X,  t). 
3.  assuming  integrability,  q'  satisfies 

0  =    argminl  E  [pT  (Y  —  q'(x,d)  —  v)  \x,  z] , 

veR 

where  pT{a)  =  tu+  +  (1  —  r)u~ . 

4-  q'  is  an  argmin      U>(x,z)   ,  where  the  minimum  is  computed  over  all  candidate 
(measurable)  functions  ip,    and,  assuming  integrability, 

v(x,  z)  =   argminl  E  \pT  (Y  —  tp(x,  d)  —  v)  \x,  z\  . 

Remark  4.1  Integrability  conditions  can  be  removed  by  subtracting  pT>(Y  —  q'(x,d)  — 
v),  where  v  is  a  fixed  number,  inside  the  expectation.  The  "argminl"  above  means 
"limT'jr  argmin,"  and  is  a  pure  technicality,  insuring  uniqueness  of  solution.  It  is  only 
needed  there  for  non-continuous  Y  and  at  most  countably  many  values  of  r  €  (0, 1). 

Theorem  2  applies  to  continuous,  discrete,  or  mixed  outcomes,  so  estimation  based 
on  Theorem  2  can  be  applied  to  such  data.  Theorem  2  is  both  interpretive  and  con- 
structive. First,  any  consistent  estimator  asymptotically  solves  the  inverse  quantile 
regression  problem.  Second,  Theorem  2  (part  4)  suggests  a  way  to  construct  practical 
estimators  (in  addition  to  obvious  method  of  moments  or  minimum  distance  methods 
based  on  equations  (5)  and  (6)). 

4.3     Conditions  for  (Global)  Identification 

Here  we  show  that  we  do  not  need  functional  form  assumptions  to  identify  QTE  as 
long  as  we  have  a  reasonable  instrument.  We  focus  on  the  case  of  binary  D,  while  the 
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appendix  contains  generalizations.  The  following  analysis  is  all  conditional  on  X  =  x, 
but  we  suppress  this  for  ease  of  notation.    Define  C{x)  as  convex  hull  of  the  set  of 
functions  if  mapping  d  from  {0,1}  to  (y  :  fY{y\d)  >  0)  such  that  P(Y  <  ip(D,r)\Z) 
belongs  to  \t  —  5,r  +  5]  a.s.  for  6  >  0. 
Define  the  following  function 

nz(^x)  =  {P[Y  <  <p{D)\zi\,  P[Y  <  <p(D)\z2]), 
where  z  =  (zj,j  =  1,2).  Assuming  relevant  smoothness  define 


M<p,x)  =  ^-nz(¥>)  = 


frM0)\D  =  0,Zl)P[D  =  0|*i]     fy(<p(l)\D  =  l,z1)P[D  =  l\zt] 
fY{<p(0)\D  =  0,z2)P[D  =  OH     fY(<p(l)\D  =  l,zi)P[D  =  l\z2) 

/r.i>(y(0),0|zi)      /y.cMl),l|2i) 
/y,o(¥7(0),0|22)      /y,r.(¥>(l),l|z2) 


We  will  say  that  rank  Jz(ip,x)  is  full  w.  pr.  >  0  if  with  positive  probability  Z  = 
(Z\,Zi)  is  such  that  rank  Jz(<p,x)  =  2,  where  Z\  and  Z2  are  independent  replica  of 
Z,  given  X  =  x. 

Theorem  3  Suppose  A1-A5  hold,  and  that  fyivld,  z,x)  >  0  and  finite  over  the  range 
of  d  t— »  q(d,x,r).   Then  d  >— »  g(d,x,  r)  is  a  unique  solution  of 

P(Y  <  q(d,x,r)\x,z)  =  r  for  P-a.e.  z,  given  X  =  x,  (7) 

among  C(x)  if  for  any  ip  £  £(a:)  JZ(<P> x)  2S  finite  and  has  full  rank  w.  pr.  >  0. 

These  conditions  are  akin  to  the  identification  of  average  treatment  effects  in  Abadie 
(2001)  or  Das  (2001).  The  difference  is  in  the  weighting  by  a  density.  The  condition 
is  easy  to  verify  in  many  applications.  For  example,  suppose  Z  =  0  or  1  as  in  the 
JTPA  example  discussed  in  Section  6.  Then  det  Jz  7^  0  is  equivalent  to  a  nonconstant 
likelihood  ratio  property: 

/r,D(y>(0),0|Z  =  1)        /y,o(y(0),0|Z  =  0) 

A,„Mi),  i|z  =  1)  *  /y,DMi),  \\z  =  o) ' 

for  any  (p  e  C(x).  The  instrument  Z  should  impact  the  joint  distribution  of  Y  and 
D  at  all  relevant  points.  In  the  JTPA  data  P[D  =  1\Z  =  0]  =  0,  which  means 
/v,d(j/i  1|Z  =  0)  =  0  for  any  y,  so  the  condition  is  always  true  as  long  as  the  left-hand- 
side  is  finite.  In  other  cases,  the  condition  is  simply  plausible. 

5     Estimation 

In  this  paper,  it  is  natural  to  focus  on  estimating  the  basic  linear  model,  which  covers 
a  wide  area  of  applications.  In  this  model  a  conditional  r-quantile  of  the  potential 
outcome  is  given  by  (or  approximated  by) 

QYdlx(T)  =  d'aT  +  X'pT,  (8) 
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where  d  is  an  I  x  1- vector  of  treatment  variables  (possibly  interacted  with  covariates) 
and  x  is  a  A:  x  1  vector  of  (transformations  of)  covariates.  This  model  is  a  specialization 
of  Al,  and  is  a  foundation  of  quantile  regression  research  (see  e.g.  Koenker  and  Hallock 
(2000)  and  Buchinsky  (1998)  for  reviews). 

Using  Theorem  2  we  offer  the  inverse  quantile  regression  estimator  as  a  finite- 
sample  analog  of  the  inverse  quantile  regression  in  the  population.  In  the  appendix,  for 
completeness  and  comparisons,  we  also  provide  the  results  for  the  generalized  empirical 
likelihood  estimators.  The  presented  estimator  is  perhaps  the  only  practical  estimator 
that  can  be  applied  to  reasonably  general  cases.  Other  strategies  such  as  method  of 
moments  or  empirical  likelihood  are  typically  infeasible,10  as  explained  below. 

To  state  the  idea  clearly,  first  suppose  we  have  no  covariates  or  simply  treat  co- 
variates as  the  part  of  vector  d  above.  In  this  case,  a  simple  analog  of  the  population 
inverse  quantile  regression  is  as  follows: 

Find  3  by  minimizing  a  norm  of  7(a)  over  a  subject  to  7(0]  solving  the  quantile 
regression  of  Y  —  D'a  on  Z:  j[a]  =    argmin   i  Y^=i  Pr^Xi  ~  D'{a  —  Z[^f). 

Now  suppose  we  have  covariates  Xt.  Then  the  procedure  can  be  modified  as  follows: 

Find  a  by  minimizing  a  norm  of  7(0]  over  a  subject  to  (7(a), /3[q])  solving  the 

quantile  regression  of  Y  —  D'a  on  Z  and  X: 

(7[a],  j9[a])  =   argmin,,,  £  ££.,  pr(Y  -  D[a  -  Xtf  -  Z'a)- 

The  estimate  of  j3  can  be  obtained  as  a  usual  quantile  regression  of  Y  —  d'a  on  X. 

In  order  to  improve  efficiency,  we  allow  the  observations  to  be  weighted  differently 
and  allow  for  estimated  instruments.  Define  the  weighted  quantile  regression  objective 
function: 

1     n 

Qn(a,/3n)  =  ~zZ  Hy*  -  D'^  ~  X'^  ~  ^9]  >     where 
5=] 

$i  =  $(Xi,  Zi),  where  $  is  a  smooth  r  x  1  vector  function  of  instruments, 
$;  =  <S?(Xi,Zi),  where  $  is  a  smooth  consistent  estimate  of  $,  satisfying  R5, 
V;  =  V(Xi,Zi)  >  0,  where  V  is  a  smooth  weight  function, 
Vi  =  V(Xi,  Zi)  >  0,  where  V  is  a  smooth  consistent  estimate  of  V,  satisfying  R5. 

Note  that  one  may  simply  set  $,  =  Zi  or  V;  =  1,  which  will  give  us  the  simpler  ver- 
sions above.  Efficient  estimation  is  described  in  Corollary  1.  We  can  use  a  wide  variety 
of  nonparametric  estimators  and  parametric  approximations  of  V  and  $,  satisfying  a 
standard  smoothness  condition,  stated  as  a  technical  assumption  R5  in  appendix  G. 
We  also  assume  (aT,(3T)  belongs  to  a  compact  set  Ay.  B.  Other  technical  conditions 


10Note,  however,  that  EL  has  many  good  properties  and  purportedly  performs  well  in  finite  samples. 
A  possible  feasible  approach  is  as  follows.  In  the  first  stage,  IQR  estimates  of  the  parameters  could  be 
obtained.  Then,  in  the  second  stage,  the  estimates  could  be  recomputed  using  EL  limiting  the  domain 
to  a  neighborhood  around  the  estimates  obtained  in  the  first  stage. 
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are  stated  as  assumptions  R1-R5  in  the  appendix.  Most  of  them  are  standard  in  the 
quantile  regression  literature. 

Now  let's  formally  define  the  estimation  procedure  as  follows: 

a  =  arg  inf  7[q]^47[q],  such  that  (9) 

(/?[<*], 7[a])=    arginf  Qn{a,p,j).  iw\ 

(/3,7)esxc;  v     ' 

where  Q  =  [— <5, 5]r  for  8  >  0  and  A  -^->  A  is  a  positive  definite  matrix.  A  final  estimate 
of  j3T  is  obtained  as 

/3  =  arginfQn(S,/3,0).  (n) 

Equations  (9)  -(10)  are  a  finite  sample  inverse  or  instrumental  quantile  regression 
(IQR).  (10)  is  the  quantile  regression  step,  and  (9)  is  the  "inverse"  step. 

This  formulation  allows  one  to  effectively  reduce  the  dimensionality  of  a  potentially 
difficult  optimization  problem  to  the  dimension  of  a.  In  GMM,  the  objective  function 
is  highly  multi-modal  and  has  zero  derivative  almost  everywhere,  implying  the  need  to 
perform  a  grid  search  over  a  subset  of  R^  where  K  =  dim(i)-l-dim(a)  (e.g.  in  Example 
2  of  the  next  section  dim(x)  +  dim(a)  =  16).  Such  an  estimator  is  infeasible,  except 
perhaps  when  dim(i)  =  2  or  3.  In  contrast,  a  simple  implementation  of  inverse  quantile 
regression  would  require  only  a  grid  search  over  a  subset  of  Rdlm(Q'.  The  regression 
quantile  steps  are  solved  as  fast  as  OLS  by  interior  point  methods  combined  with 
preprocessing,  see  Portnoy  and  Koenker  (1997).  The  computations  may  be  improved 
further  by  employing  parametric  programming.  In  this  approach  the  quantile  regression 
in  (10)  is  initially  solved  for  some  qo,  then  one  solves  for  P\a\  and  7(0]  for  nearby  a 
using  a  standard  sensitivity  analysis. 

We  now  turn  to  the  theoretical  properties  of  the  estimator.  In  the  appendix  we  also 
study  the  properties  of  the  generalized  empirical  likelihood  estimators. 

Theorem  4   Under  assumptions  R1-R6  listed  in  the  appendix 

V^(a-ar)-^KAT(0,S), 

where  convergence  is  joint,  and  7V(0,  S)  is  normal  vector  with  mean  0  and  variance 
S  =  t(1-t)£$$',  where  K  =  {J'aHJa)-lJ'QH,  H  =  J^AJ-,,  L  =  J^l[Ik  :  0]M,  M  = 
I  -  JaK,  $  =  V  •  [X'  :  $']',  Ja  =  E  [/«(0|X,  D,  Z)<S>D'}  and  J0  =  E  [/«(0|X,  Z)XX'), 
where  t  =  Y-  DaT  -  X'/3T.  Finally,  Je  =  E\\/V  ■  /«(0|X,  Z)**'],  where  [J'0  :  J;]'  is 
the  partition  of  Jg    ,  such  that  Jp  is  a  k  x  (k  +  r)  matrix  and  J7  is  a  r  x  (k  +  r)  matrix. 

Corollary  1  Generally,  when  the  number  of  instruments  $  equals  that  of  enodogenous 
regressors  D,  the  joint  asymptotic  variance  of  a  and  (3  has  a  simple  form 

J'^SJ-1,  (12) 
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for  J  =  E[f,(0\X.D,Z)<b[D'  :  X'}}.  Choice  of  A  is  irrelevant.  Further,  if  $  =  $*  = 
E[D  ■  v\Z,X]/V*,  where  v  =  ft[0\D,Z,X],  and  V  =  V*  =  fe(0\X,Z),  the  asymptotic 
variance  of  a  and  (3,  simplifies  to 

t(1-t)£[^***']-1,  where  V*  =  V*    [X', $*']'•  (13) 

Corollary  2  If  the  number  of  instruments  $  is  larger  than  the  number  of  enodogenous 
regressors  D,  choice  of  weighting  matrix  A  matters.  An  optimal  choice  of  A  is  given  by 
A  =  \Jg  ]22  =  (J-, Jg J-,)-1,  an  r  x  r  matrix.  In  this  case  the  joint  asymptotic  variance 
of  a  and  J3  equals  NSN',  where  N  =  {J' J~l  J)'1  J' J~l .  If  in  addition,  V  =  V* ,  the 
joint  variance  equals  (JS     J)      . 

(13)  is  the  efficiency  bound  for  the  GMM  estimators  under  conditional  moment 
restrictions  as  in  Theorem  1.  This  is  the  efficiency  bound  in  the  sense  of  Amemiya 
(1977),  Chamberlain  (1987),  or  Newey  (1990).  See  also  Newey  and  Powell  (1990). 

Corollary  1  suggests  a  reasonable  approach  to  estimation  and  inference. 

First  of  all,  in  section  6,  we  used  a  simplest  and  most  transparent  strategy,  projecting 
D  on  Z  with  OLS  to  form  the  instrument  $,  and  setting  V;  =  1.  We  used  methods 
described  in  Koenker  (1994)  to  obtain  the  estimates  of  standard  errors  based  on  the 
simple  formula  (12).  Powell  (1986)'s  methods  also  apply  without  modification. 

Generally,  we  can  use  many  established  methods  to  either  approximate  or  imple- 
ment exactly  the  optimal  procedure.  We  can  estimate  /£(0|)  by  the  kernel  methods 
described  in  Andrews  (1994)  or  quantile  regression  differencing  as  in  Koenker  (1994), 
and  E[Dv\Z,  X]  can  be  estimated  using  series  estimation  (e.g.  OLS  of  Dv  on  Z,  X 
and  their  powers),  as  in  Newey  (1997)  and  Andrews  and  Whang  (1990).  Assumption 
R5  allows  for  a  wide  variety  of  nonparametric  and  parametric  estimation  procedures  - 
Andrews  (1994)  discusses  a  number  of  them. 

In  practice,  it  is  often  reasonable  to  use  parametric  approximations,  cf.  Amemiya 
(1975).  For  example,  we  may  use  conditional  normality  for  /e(|)  to  get  an  approxima- 
tion of  the  standard  errors  and  optimal  weights  in  the  quantile  regressions  above.  On 
the  other  hand,  E[Dv\Z,  X]  can  be  approximated  by  polynomial  functions  in  Z,  X  and 
estimated  by  OLS.  As  long  as  approximation  of  the  optimal  procedure  is  accurate,  the 
standard  errors,  based  on  (13)  or  on  a  more  robust  formula  (12),  will  also  be  accurate. 

When  there  is  a  compelling  reason  to  use  instruments  $  of  dimension  larger  than 
that  of  D,  Corollary  2  describes  the  choice  of  the  weighting  matrix  A  that  simplifies 
the  asymptotic  variance. 

The  documented  computer  programs  in  programming  languages  R  (free  software 
available  from  www.r-project.org)  and  Matlab  that  implement  the  estimation  and  in- 
ference are  available  from  the  authors.  The  programs  implement  both  the  optimal  and 
sub-optimal  instrument  cases. 
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6     Empirical  Applications 

This  section  presents  the  empirical  illustration  to  the  economic  models  presented  in 
section  3.  The  first  example  is  a  market  demand  model,  and  the  second  example  is  an 
evaluation  of  a  job  training  program. 

6.1      Demand  for  Fish 

In  this  section,  we  present  estimates  of  demand  elasticities  which  may  potentially  vary 
with  the  level  of  demand,  r.  The  data  contain  observations  on  price  and  quantity  of 
fresh  whiting  sold  in  the  Pulton  fish  market  in  New  York  over  the  five  month  period 
from  December  2,  1991  to  May  8,  1992.  These  data  were  used  previously  in  Graddy 
(1995)  to  test  for  imperfect  competition  in  the  market  and  later  in  Angrist,  Graddy, 
and  Imbens  (2000)  to  illustrate  use  of  the  conventional  IV  estimator  as  a  weighted 
average  of  heterogeneous  demands.  The  price  and  quantity  data  are  aggregated  by 
day,  with  the  price  measured  as  the  average  daily  price  for  the  dealer  and  the  quantity 
as  the  total  amount  of  fish  sold  that  day.  The  data  also  contain  information  on  the 
day  of  the  week  of  each  observation  and  variables  indicating  weather  conditions  at  sea, 
which  are  used  as  instruments  to  identify  the  demand  equation.  The  total  sample 
consists  of  111  observations  for  the  days  in  which  the  market  was  open. 
The  demand  function  we  estimate  takes  a  standard  Cobb-Douglas  form: 

Qin(yp)|x(T)  =  aT\np  +  X'PT, 

where  Yp  is  demand  when  price  is  p.  The  elasticity  aT  varies  across  the  quantiles 
t  of  demand  level.  Following  discussion  in  section  3,  this  is  a  demand  model  with 
non-separable  error  and  random  elasticity. 

The  top  two  panels  of  Figure  1  provide  the  estimates  of  elasticities  obtained  by 
IQR  of  ln(Y)  on  ln(P)  using  wind  speed  as  the  instrumental  variable,  while  the  lower 
panels  depict  standard  quantile  regression  (QR)  estimates.  The  shaded  region  around 
the  point  estimates  represents  the  80  percent  confidence  interval.  While  the  reported 
estimates  are  for  a  model  without  covariates,  the  estimated  elasticities  are  not  sensitive 
to  the  inclusion  of  dummy  variables  for  the  days  of  the  week  or  other  covariates. 

The  price  effect  on  quantities  sold,  as  estimated  by  QR,  appears  to  be  approximately 
constant  across  the  entire  range  of  quantiles.  The  magnitudes  of  the  effects  are  also 
quite  small,  in  all  cases  much  less  than  unity.  IQR  estimates,  on  the  other  hand, 
range  from  -2  to  -.5,  with  the  median  elasticity  of -1,  indicating  variation  of  elasticities 
with  the  level  of  demand.  Except  at  high  quantiles,  the  IQR  elasticities  are  uniformly 
greater  in  magnitude  than  the  price  effects  predicted  by  QR.  This  is  clearly  shown 
in  the  demand  curves  plotted  in  Figure  2.  Note  that  the  interpretation  of  IQR  and 
QR  estimates  is  very  different.  IQR  estimates  a  (causal)  demand  model,  while  QR 
estimates  the  conditional  quantiles  of  the  equilibrium  quantity  as  a  function  of  the 
equilibrium  price. 

The  IQR  estimates  of  the  demand  elasticities  aT  illustrate  heterogeneity  across  the 
demand  levels.  The  results  indicate  that  demand  elasticity  is  quite  high  in  magnitude 
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at  low  quantiles,  but  is  decreasing  in  the  quantile  index.  While  there  are  many  possible 
explanations  for  this  demand  behavior,  it  does  cast  doubt  on  the  hypothesis  that  the 
aggregate  demand  in  this  market  is  a  sum  of  the  demand  curves  of  numerous  identical 
price-taking  agents  who  randomly  arrive  at  the  market.  The  estimates  may  also  suggest 
that  a  single  statistic  may  be  insufficient  to  truly  capture  the  demand  function  variety. 

6.2     Evaluation  of  a  JTPA  Program 

The  impact  of  job  training  programs  on  the  earnings  of  participants,  especially  those 
with  low  income,  is  of  great  interest  to  economists,  but  evaluating  the  causal  effect  of 
training  programs  on  earnings  is  difficult  due  to  the  self-selection  of  treatment  status. 
However,  data  available  from  a  randomized  training  experiment  conducted  under  the 
Job  Training  Partnership  Act  (JTPA)  provides  a  mechanism  for  addressing  this  issue. 
In  the  experiment,  people  were  randomly  assigned  the  offer  of  JTPA  training  services, 
but  because  people  were  able  to  refuse  to  participate,  the  actual  treatment  receipt 
was  self-selected.  Of  those  offered  treatment,  only  60  percent  participated  in  the 
training.  There  was  also  a  small  number  of  individuals  from  the  control  group  who 
received  training.  The  random  assignment  of  the  training  offer  provides  a  plausible 
instrument  for  a  person's  actual  training  status.  Adadie  et.  al.  (2000)  and  Heckman 
and  Smith  (1997)  provide  detailed  information  regarding  data  collection  procedures 
and  institutional  details  of  the  JTPA.  We  limit  the  analysis  to  the  adult  males. 
To  capture  the  effects  of  training  on  earnings,  we  estimate  a  linear  model: 

QYd\x(T)  =  daT  +  X'pT, 

where  d  indicates  training  status  and  is  instrumented  for  by  assignment  to  the  control 
group,  the  potential  outcomes  Yd  are  earnings,  and  X  is  a  vector  of  covariates.  The 
data  consist  of  5,102  observations  with  data  on  earnings,  training  and  assignment 
status,  and  other  individual  characteristics.  Earnings  are  measured  as  total  earnings 
over  the  30  month  period  following  the  assignment  into  the  treatment  or  control  group. 
We  also  include  dummies  for  black  and  Hispanic  persons,  a  dummy  indicating  high- 
school  graduates  and  GED  holders,  five  age-group  dummies,  a  marital  status  dummy, 
a  dummy  indicating  whether  the  applicant  worked  12  or  more  weeks  in  the  12  months 
prior  to  the  assignment,  a  dummy  signifying  that  earnings  data  are  from  a  second 
follow-up  survey,  and  dummies  for  the  recommended  service  strategy.11 

Results  for  standard  quantile  regression  are  illustrated  in  Figure  4  and  IQR  esti- 
mates in  Figure  3.  The  shaded  region  represents  the  90  percent  confidence  interval 
for  the  point  estimates.  The  first  panel  in  each  figure  shows  the  estimated  impact  of 
the  participation  in  the  training  program  across  various  quantiles.  A  quick  compari- 
son of  the  two  sets  of  results  shows  that  the  standard  quantile  regression  estimates  of 
the  statistical  impacts  of  training  are  well  above  the  treatment  effect.  The  quantile 
regression  estimates  are  uniformly  larger  than  the  IQR  estimates,  and  in  many  cases 
the  difference  is  quite  substantial.     This  difference  is  perhaps  most  important  in  the 


nThe  recommended  service  strategy  was  broken  into  three  categories:  classroom  training,  on-the-job 
training  and/or  job  search  assistance,  and  other  forms  of  training. 
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low  to  middle  quantiles  where  the  conventional  quantile  regression  estimates  indicate 
a  relatively  large  statistical  impact  of  training  on  the  earnings  of  participants. 

The  differences  in  the  standard  QR  and  the  IQR  estimates,  as  well  as  the  distri- 
butional impacts  of  the  program,  are  made  even  more  apparent  when  one  considers 
the  impact  of  training  in  percentage  terms.12  Quantile  regression  estimates  indicate 
large  percentage  impacts,  especially  in  the  lower  quantiles.  The  IQR  estimates,  on 
the  other  hand,  indicate  that  the  percentage  causal  impact  of  the  training  program  is 
relatively  constant  and  low,  between  5  and  10  percent,  along  the  whole  distribution. 
This  is  interesting  since  the  supposed  intent  of  job  training  programs  is  to  raise  the 
incomes  of  low  income  individuals.  However,  we  observe  that  the  impacts  were  actually 
the  greatest  for  the  upper  quantiles. 

Coefficient  estimates  for  several  of  the  covariates  are  also  included  in  the  figures. 
None  of  the  results  are  particularly  surprising.  Being  Hispanic  has  no  significant  impact 
on  potential  earnings  at  any  point  in  the  distribution,  while  at  medium  and  high 
quantiles,  blacks  earn  significantly  less  than  whites.  We  also  see  that  education,  as 
measured  by  high  school  graduation  or  having  a  GED,  has  a  positive  impact  along 
almost  the  entire  distribution,  with  the  impact  growing  monotonically  in  the  quantile 
index.  This  pattern  is  also  observed  for  the  marriage  effect,  which  tapers  off  in  the 
highest  quantiles.  The  effect  of  having  worked  little  in  the  previous  year  runs  in 
almost  exactly  the  opposite  direction,  impacting  earnings  negatively  at  all  quantiles 
and  decreasing  earnings  substantially  in  the  upper  tail  of  the  distribution. 

We  next  compare  our  results  with  those  in  Abadie  et  al.(2001).  Since  identification 
in  two  models  comes  through  different  assumptions  and  the  estimated  treatment  ef- 
fects are  for  different  populations  (the  Abadie  et  al's  model  is  for  the  sub-population  of 
LATE-compliers),  the  estimation  results  need  not  agree.  However,  the  JTPA  is  an  ex- 
ample where  both  sets  of  assumptions  appear  to  hold.  Independence  and  monotonicity 
are  almost  certainly  satisfied,  and  it  seems  reasonable  that,  relative  to  others  with  sim- 
ilar characteristics,  similarity  assumption  is  also  fulfilled.13  Under  these  conditions, 
the  models  overlap  and  the  results  should  indeed  be  comparable  if  the  subpopulation 
of  LATE-compliers  is  representative  of  the  entire  population.  This  appears  to  be  the 
case  in  the  present  example. 

Lastly,  consider  the  results  of  Heckman  and  Smith  (1997).  The  model  of  Heckman 
and  Smith  (1997)  did  not  incorporate  endogeneity  (it  had  a  different  point).  Thus  their 
results  correspond  to  our  QR  results  (fig  4),  and  differ  from  the  IQR  results  (fig  3). 

6.3     Numerical  Performance 

The  objective  functions  for  selected  quantiles  from  Examples  1  and  2  are  graphed 
in  Figure  5.  The  upper  three  panels  in  the  figure  illustrate  the  objective  functions 
from  the  fish  example,  while  the  lower  panels  correspond  to  the  JTPA  example.    The 


12The  percentage  impact  of  training  is  calculated  for  both  whites,  Percentage  Impact  I,  and  black, 
Percentage  Impact  II.    Percentages  are  calculated  for  married  high-school  graduates  aged  30  to  35. 

Note  that  conditioning  on  covariates  weakens  the  required  similarity  condition  requiring  that 
similarity  only  hold  for  people  with  the  same  covariate  values. 
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objective  functions  are  very  well-behaved,  especially  in  the  JTPA  example.  Each  of 
the  objective  functions  from  the  fish  example  does  have  many  local  minima,  which  is 
attributable  to  the  small  sample  size.  However,  in  all  cases,  the  functions  have  an 
obvious  unique  global  minimum. 

7     Conclusion  and  Future  Research 

This  paper  offered  two  contributions.  First,  it  proposed  a  model  of  quantile  treatment 
effects  which  allows  for  treatment  endogeneity.  The  model  exploits  the  similarity  as  a 
main  identification  restriction.  The  resulting  model  differs  from  both  Heckman's  non- 
parametric  selection  model  and  Imbens  and  Angrist's  LATE  model.  From  this  model 
we  derive  a  Wald  IV  estimating  equation.  We  show  that  the  model  does  not  require 
functional  form  assumptions  for  identification.  Second,  we  characterized  the  quantile 
treatment  function  as  solving  an  inverse  quantile  regression  problem  and  suggested 
its  finite-sample  analog  as  a  practical  estimator.  This  estimator,  unlike  generalized 
method-of-moments,  can  be  easily  computed  by  solving  a  series  of  conventional  quan- 
tile regressions,  and  does  not  require  grid  searches  over  high-dimensional  parameter 
sets.  A  properly  weighted  version  of  it  is  also  efficient.  We  applied  this  estimator  to 
characterize  quantile  treatment  effects  in  a  market  demand  model  and  evaluation  of  a 
job  training  program. 

An  important  feature  of  the  proposed  model  is  that  even  though  one  may  not  be 
interested  in  quantile  treatment  effects,  one  may  still  have  to  estimate  them.  Indeed, 
the  average  treatment  effects  can  not  be  estimated  by  conventional  IV  methods,  as 
shown  in  example  5.1.14  Instead,  quantile  treatment  effects  have  to  be  estimated  first 
and  then  integrated  over  the  quantile  index.  Alternatively,  one  may  estimate  only  the 
median  treatment  effects,  using  the  proposed  model  and  estimator. 

In  companion  works,  we  consider  a  number  of  directions.  In  a  joint  work  with  Whit- 
ney Newey  and  Guido  Imbens,  we  explore  fully  non-parametric  estimation,  which  poses 
an  interesting  problem.  Other  research  directions  are  also  considered.  For  example, 
an  important  research  question  is  how  to  estimate  policy-relevant  treatment  effects  in 
an  expected  utility  framework,  given  particular  social  loss  functions,  known  program 
costs,  and  effects  on  choice  probabilities  (cf.  Heckman  et  al  (2001)). 


*Note  that  the  local  average  treatment  effect  may  still  be  identified  without  estimating  the  QTE. 
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A.  IQR:  Treatment  Effect 
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B.  QR  :  Price  Effect 
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Figure  1:  Inverse  Quantile  Regression  and  Quantile  Regression  Results  for  fish  data.  The 
quantile  treatment  effect,  estimated  by  IQR,  is  the  elasticity  of  the  T-th  quantile  demand 
curve.  It  tends  to  be  much  higher  than  the  "price  effect"  on  the  r-quantiles  of  quantities  sold, 
estimated  by  QR. 
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Figure  2:  Le/t  Column:  The  estimated  by  IQR  demand  curves,  indexed  by  the  quantile  index 
(.1,  .2,  .3,  .5,  .7,  .8.  and  .9.).  The  top  display  is  in  log-price-log-quantity  space.  The  bottom 
display  is  in  the  original  space.  Right  Column:  The  estimated  conditional  quantile  curves  of 
fish  quantity  sold  as  a  function  of  price.  The  top  display  is  in  log-price-log-quantity  space. 
The  bottom  display  is  in  the  original  space. 


IQR:  Treatment  Effect 
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Figure  3:  Inverse  Quantile  Regression  results  on  JTPA  data. 


QR:  Training 
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Figure  4:  Quantile  Regression  results  on  JTPA  data. 
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Figure  5:   A.  IQR  objective  functions  for  the  fish  example.   B.  IQR  objective  functions  for 
the  JTPA  example. 


A     Definitions  and  Lemmas 

We  use  the  following  empirical  processes  in  the  sequel,  for  W  =  (Y,  D,  X,  Z) 

f  ->  E„/(W)  =  -f]  f(Wt),    f  ~  G„/(W)  =  4=  V  (/(WO  -  £/(Wi))  ■ 

For  example,  if  /  is  estimated  function,  Gn/(W)  means:  ^=  £"=1  (/ (Wi)  -  Ef(Wi))f=j. 
Outer  and  inner  probabilities,  P*  and  P,  are  defined  as  in  van  der  Vaart  (1998).  In  this  paper 
-^->  means  convergence  in  (outer)  probability,  and  — ►  means  convergence  in  distribution.  We 
will  say  that  process  {I  i— *  vn(l),l  £  C}  is  stochastically  equi- continuous  (s.e.)  in  £°°(C)  if  for 
each  e  >  0  and  77  >  0,  there  is  <5  >  0  : 

limsup   P*(     sup     \vn(l)  —  v„(l')\  >  77)  <  € 

ti— *oo  p(Z,l')<<5 

for  some  pseudo-metric  p  on  £,  such  that  (C,p)  is  totally  bounded  pseudo-metric  space. 

The  following  results  are  from  Knight  (1999).  They  allow  general  discontinuities  and  R  - 
valued  objective  functions.  Related  literature  is  Rockafellar  and  Wets  (1998). 

Lemma  A.l  (Geyer's  Lemma)  Suppose  {Qn}  is  a  sequence  of  lower-semi-continuous  con- 
vex Ik-valued  random  functions,  defined  on  R  ,  and  let  V>  be  a  countable  dense  subset  of  R  . 
If  Qn  converges  to  Qoo  in  R  on  T>,  in  finite  dimensional  sense,  where  Qoo  is  Isc  convex  and 
finite  on  an  open  non-empty  set  a.s.,  then 

arginf  Q„(z)  — ►  arginf  Qoo(z), 

provided  the  latter  is  uniquely  defined  a.s.  in  R  . 
Lemma  A.2  (Approximate  Argmins)   Suppose 

i.     Zn    is   S.t.    Qn{Zn)   <  >nfzeRd  Qn(z)  +  in,   in   \  0;   Zn   =  Op(l). 

ii.   Zoo  s  argminz£Rd  Qoa(z)  is  uniquely  defined  inM.d  a.s. 
Hi.   Qn()  =>  <3oo()  in  t°°(K)  over  any  compacts  K,  where  Qoo  is  continuous.   Then 

Zn   *  Zoo. 

B     Two-Stage  Quantile  Regression:  Inconsistency  when 
QTE  varies  with  r 

The  model  proposed  by  Amemiya  consists  of  two  equations 

(t)     Y  =  D'6  +  X'(3  +  U, 

(14) 
(ii)     D  =  Z'-y  +  V, 

where  D  is  an  endogenous  vector,  i.e.  D  depends  on  the  real-valued  U,  X  is  a  vector  of 
exogenous  or  predetermined  variables,  U  and  V  are  independent  of  X  and  Z,  and  U  and  V 
are  jointly  symmetric  and  absolutely  continuous. 
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Parameters  (/3,5)  can  be  estimated  by  two-stage  LAD,  Amemiya  (1982).  by  projecting 
D  on  Z  to  get  7,  and  using  the  median  regression  of  Y  on  (Z'^y,X).  Another  valid  second 
stage  is  the  quantile  regression,  cf.  Chen  and  Portnoy  (1996),  known  as  two-stage  quantile 
regression  (2SQR). 

Model  (14)  imposes  the  constant  QTE.  The  treatment  variable,  if  assigned  externally,  shifts 
the  location  of  the  outcome  variable,  but  does  not  affect  the  scale  or  shape  of  its  distribution. 
This  severely  limits  the  treatment  variety. 

The  assumptions  of  constant  QTE  is  crucial  for  validity  of  2SQR.  If  QTE  effects  are 
non-constant,  2SQR  does  not  consistently  estimate  them.  Unfortunately  a  rather  extensive 
empirical  literature  has  used  2SQR  to  estimate  the  non-constant  QTE. 

To  explain  the  inconsistency,  it  suffices  to  consider  an  example  with  no  endogeneity.  Sup- 
pose for  some  increasing  one-to-one  map  5(): 

Y  =  DS(U),U  =  U(0,1), 

D  =  Z'f  +  V,  (15) 

Z,  U,  V  are  mutually  independent. 

Assume  that  V,  V  have  densities  conditional  on  Z  and  that  D  >  0.  To  pin  down  7  we  may 
assume  E[V]  =  0.  It  is  sufficient  to  show  that  6(t)  is  generally  not  the  optimum  in  the 
population  2SQR  problem.  That  is,  there  is  no  a  such  that 

i.    E(l(Y<a  +  5(T)Z'y)-T)=0, 
ii.   E  (l(y  <  a  +  5(j)Z'-y)  -  t)  Z'-y  =  0. 

By  definition  {Y  <  a  +  S(r)Z'-y}  =  {V5(U)  +  Z'-y{5{U)  -  6(t))  <  a}.  Equation  i.  implies: 

Q  =  Qm(t),  where  M  =  V8(U)  +  Z'-y  ■  (6(U)  -  5(r)), 

thus  it  remains  to  check  whether 

£(1(M<Qm(t))-t).Z'7  =  0.  (17) 

Generally  (17)  is  false.  Equation  (17)  holds  when  M  is  T-quantile  independent  of  Z: 

Qm\z{t)  =  Qm(t)    Pa.e.«?[M  <Qm(t)\Z]  =  t   P  a.e. 

This  necessarily  happens  when  <5(t)  =  5,  the  constant  treatment  effect  case,  or,  for  example, 
when  t  =  1/2  and  M  is  symmetric  given  Z,  as  in  Amemiya  (1982). 

Simple  examples  suffice  to  confirm  that  (17)  indeed  fails.  The  first  example  involves  no 
endogeneity: 

•  V  =  5  +  N(0, 1),     truncated  to  be  positive, 

•  Z'-y  =  5  +  N(0, 1),  truncated  to  be  positive, 

.  5(U)  =  N(0, 1)/100,  D  =  Z'-f  +  V,  Y  =  D   6(U). 
The  following  computation  uses  monte-carlo  integration  using  500, 000  simulations. 

.  E{1(M  <  Qm{t))  -T)Z'-y  =  0.34,  for  r  =  .7  with  s.e.  of  .003 
The  second  example  involves  endogeneity: 

•  V  =  5  +  N(0, 1),  truncated  to  be  positive, 
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•  Z'"f  =  5  +  iV(0, 1),  truncated  to  be  positive, 
.  S(U)  =  V/100,  D  =  Z'-y  +  V.  Y  =  D  -  6(U). 

The  following  computation  uses  monte-carlo  integration  using  500,  000  simulations. 

•  E  (1(M  <  Qm{t))  -  t)  Z'-y  =  0.38,  for  r  =  .7  with  s.e.  .003 

C      Comparison  with  Abadie  et  al's  Model 

In  Abadie,  Angrist,  and  Imbens  (2001),  henceforth  AAI,  the  treatment  variable  D  and  the 
instrument  Z  are  both  binary.  The  binary  nature  of  D  is  critical,  and  extensions  to  the 
general  case  are  not  known.  The  general,  non-binary  case  is  clearly  important.  This  approach, 
however,  is  well  suited  to  many  experimental  studies. 

The  potential  outcomes  Yd  are  indexed  by  the  treatment  status  d  €  {0, 1},  and  the  potential 
treatments  D2  are  indexed  by  the  instrument  status  z  e  {0,  1}.  The  realized  outcome  is 
Y  =  Yd,  while  the  realized  treatment  is  D  =  Dz-  AAI  impose  the  independence  condition 

(Y0,  YltDi,  Do)  are  independent  of  Z,  (18) 

and  the  monotonicity  assumption: 

Di  >  Do  a.s.  (19) 

For  example,  let 

Dz  =  l(v(Z)  >  V),      where  Z  is  independent  of  V,  (20) 

i.e.  Do  =  l(y>(0)  +  V)  and  Di  =  1(^(1)  +  V).  V  may  depend  on  the  potential  outcomes  Yo 
and  Vi.  Model  (20)  along  with  (18)  satisfies  the  independence  and  monotonicity  assumption. 
(Vytlacil  (2001)  also  shows  the  converse  is  true  as  well,  in  the  sense  of  distribution  equivalence.) 
Exploiting  that  D  and  Z  are  binary,  independence,  and  monotonicity,  AAI  show  that  in 
the  subpopulation  of  compilers,  where  D\  >  Do,  the  realized  treatment  D  is  independent  of 
potential  outcomes: 

(Yi,  Yo)  are  independent  of  D  \X,  D\  >  Do- 

The  compilers  are  manipulated  by  the  instrument  and,  therefore,  randomly  receive  a  treat- 
ment status.  That  is,  the  treatment  status  is  given  to  them  independently  of  their  potential 
responses  Vo  and  Vj,  conditional  on  observed  covariates  X.  That  is,  endogeneity  is  removed 
in  this  subpopulation. 

Let  Qy\x,c(t)  denote  the  r-quantile  for  the  population  of  compilers  conditional  on  (X,  D\  > 
Do)-  The  quantile  treatment  effect  <5(r)  is  a  difference  in  the  conditional  r-quantiles  of  Y\  and 
Vo  for  compliers: 

QyIx.c(t)=6(t)D  +  X'P(t). 

AAI  suggest  an  ingenious  weighting  scheme  that  "finds  compliers"  (compliers  are  unobserved) 
and  interpret  their  estimator  as  a  re-weighted  Koenker  and  Bassett's  quantile  regression. 

The  main  differences  with  our  approach  are  the  following. 

First,  our  model's  QTE  is  defined  relative  to  the  population,  while  AAI's  QTE  is  defined 
relative  to  compliers.  The  compliers  may  substantially  differ  from  the  entire  population.  For 
example,  in  Angrist  and  Krueger  (1992),  the  compliers  are  those  whose  education  level  is 
affected  by  their  birthdate.  Thus,  the  90%  QTE  in  AAI's  model  may  differ  substantially  from 
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the  90%  QTE  in  our  model.    The  QTE's  of  two  models  may  coincide  if  compilers  in  AAPs 
mode!  are  representative  of  the  population  and  other  assumptions  overlap  as  well. 

Second,  AAI's  model  applies  to  binary  cases  only,  while  the  present  approach  applies  to 
general  cases.  Third,  the  estimation  procedure  are  fundamentally  different.  Fourth,  we  use 
similarity  or  rank  invariance  to  identify  QTE  while  AAI  use  the  monotonicity  and  stronger 
independence  conditions. 

D     Proof  of  Theorem  1 

Part  (a).  Conditioning  on  X  =  x  is  suppressed.  For  P-a.e.  value  z  of  Z 
P[Y<q[D,T]\Z  =  z] 

(=}  P[q[D,UD]  <  q[D,r]\Z  =  z] 

(J?  P[UD  <t\Z  =  z], 

l2>    f  P[UD<  t\Z  =  z,  V  =  v]  dP  [V  =  v\Z  =  z] 

W  J  p  [cV,„,  <  t\Z  =  z,  V  =  v]  dP  [V  =  v\Z  =  z] 


(5) 


f  P[U0<  t\Z  =  z,  V  =  v]  dP  [V  =  v\Z  =  z] 


(=)P[L/0<r|Z  =  z] 

(7) 

=  r. 

Equality  (1)  is  by  Al  and  A5.  Equality  (3)  is  by  definition.  Equality  (4)  is  by  A2.  Equality 
(5)  is  by  the  similarity  assumption  A4:  for  each  d,  conditional  on  (V  =  v,  X  =  x,  Z  =  z) 

Us(z,v)  equals  in  distribution  to  U0. 

Equality  (6)  is  by  definition  and  equality  (7)  is  by  A3.  Note  that  equality  (2)  is  immediate 
when  r  i->  q{d,  r)  is  continuous,  since  we  assumed  that  r  >-*  q(d,  r)  is  strictly  increasing.  To 
show  (2)  holds  more  generally,  simply  note  that  for  t  6  (0, 1)  the  event  {UD  <  t}  implies 
the  event  {q [D,Uo]  <  <j[DiT]}  by  t  h  q\d, r]  non-decreasing  on  (0, 1)  for  each  d.  On  the 
other  hand,  the  event  {q[D,  UD]  <  q\D,  t]}  implies  the  event  {Ud  <  t},  since  t  i->  q[d,  t]  is 
strictly-increasing  and  left-continuous15  in  (0, 1)  for  each  d. 

Finally,  since  r  i->  q\d,  r]  is  strictly  increasing,  left-continuous,  we  have 

P[q[D,UD]  =  q[D,T]\Z  =  z]=0, 

so  that  P-a.e. 


P[Y<q[D,r]\Z]  =P[K  <q[D,r]\Z}. 


3r  t— *  q  [dy  r]  is  said  to  be  left-continuous  if  HmT/jT  q  [rf,  r'\  =  q  [d,r]. 
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Part  (b).  Conditioning  on  X  =  x  is  suppressed.  For  P-a.e.  value  z  of  Z 
P[Y  <q[D,T]\Z  =  z] 

{l)P[g[D>UD]<q{D,r}\Z  =  z} 

(2)  (22) 

>  P[UD  <t\Z  =  z], 

(3) 

>  P  [U0  <  t\Z  =z]=t, 

where  the  equalities  (1)  and  (3)  are  by  the  same  arguments  as  in  the  proof  of  part  (a),  and 
equality  (2)  follows  because  the  event  {UD  <  t}  is  a  subset  of  the  event  {q  [D,  UD]  <  q  [D,  r]} 
by  t  i— >  q  [d,  t]  non-decreasing  for  each  d.  On  the  other  hand, 


P[Y  <  q[D,r]\Z  =.z] 

]<q[D,r]\Z  =  z] 

(23) 


®P[q[D,UD]<q[D,r]\Z  =  z] 


=  (ar<)P[UD  <t\Z  =  z] 

<&P\U0<t\Z  =  z]=t, 

where  the  equalities  (4)  and  (6)  are  by  the  same  arguments  as  in  the  proof  of  part  (a).  (5) 
holds  as  an  equality  if  q  [D,  t]  is  strictly  increasing  at  r,  P-a.e.,  conditional  on  Z  =  z,  since 
the  event  {q  [D,  UD]  <  q[D,r]}  equals  the  event  {UD  <  t}  P-a.e.  If  on  the  other  hand,  if 
q[D,  t]  is  flat  at  t,  P-a.e.,  conditional  on  Z  =  z,  with  prob  >  0,  conditional  on  Z  =  z,  then 
{g  [.D,  UD]  <  q  [D,  t]}  is  a  strict  subset  of  the  event  {UD  <  r}  by  r  >— ►  q  [d,  t]  non-decreasing 
and  left-continuous  for  each  d.  ■ 


E     Proof  of  Theorem  2 

First  show  that  statement  (1)  4=>  statement  (2).  Let  e  =  Y  —  q'(d,X,r).  This  follows  imme- 
diately by  definition  Qe\x,z(T)  —  inf{m  :  P[t  <  m\X,  Z]  >  t}. 

We  next  show  that  statement  (2)  •!=>  statement  (3).  0  =  Qe[r\x,  z]  is  the  conditional 
quantile.  Therefore,  it  is  a  best  predictor  under  asymmetric  absolute  loss,  cf.  Manski  (1985), 
p.  55.  We  need  to  show  a  stronger  fact  —  (2)  ■&  (3)  —  extending  his  argument.  Write  for 
any  v  <  0 

E[pr(e  -  v)\x,  z]-E  \pT  (<r)  \x,  z] 

=  (1  -  t)  [  v  dFe[e\x,  z]+  J       [e-  tv]  dFe[e\x,  z] 

+  t  [       {-v)dFe[e\x,z] 
J\o,< 


(24) 


l,oo) 


=  (l-r)  v  P[e  <  v\x,z]  +  (1  -r)  v  P[e  G  (v,0)\x,z\  -  rvP[e  >Q\x,z] 
+  [      (e-v)  dFc[e\x,z] 

=  v((l  -r)P[e  <0\x,z]  -P[e  >0\x,z]r)  +  [      (e  -  v)  dFc[e\x,z]  >  0. 


(25) 
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(25)  >  0,  since  v  <  0  and  (i)  P[e  <  0|x,z]  <  r  and  (ii)  f°(e  -  v)  dFc[e\x,z]  >  0,  and  one  of 
these  inequalities  must  be  strict.  Indeed,  if  P[e  =  0|x,  z]  >  0,  the  inequality  (i)  is  strict.  If  on 
the  other  hand,  P[e  =  0|x,  z]  =  0,  then  the  inequality  (ii)  must  be  strict.  Indeed,  in  this  case 
/  (e  -  v)dFe\e\x,  z]  =  0  occurs  only  if  Pe[e|x,2]  is  flat  (assigns  no  mass)  on  (v,  0),  which  given 
that  there  is  no  mass  at  0,  contradicts  to  0  =  Qc\Xiz(t)- 
Next,  for  any  v  >  0 

E  [pT{e-v)\x,z\-  E[pT{e)\x,z\ 

=  (1-t)/  v  dFe[e\x1z]+  J       [(l-T)v-e]  dFe[e\x,z] 

+  t  f        (-v)  dFt[e\x,z] 

7["'oo)  (26) 

=  (1  -  t)  v  P[e  <  0|x,  z]-t  v  P[ee  (0,  v)\x,  z]  -  tv  P[e  >  v\x,  z] 


+  f       [v-e]  dF,[e\x,z] 

J(0,u) 

=  ^((1  -  r)  P[e  <  0\x,  z]-t  P[e  >  0|x,  2])  +   /        [v-e]  dFc[e\x,  z]  > 


>(0,v) 

(26)  >  0  since  v  >  0  and  P[e  >  0|x,  z]  <  1  -  r  and  /0"[t;  -  e]  dFe[e|x,  2]  >  0.  (26)  =  0  if  (i) 
P[e  >  0|x,  2]  =  1  —  r,  thus  P[e  <  0|x,2]  =  r  and  (ii)  the  second  term  is  zero,  (ii)  happens 
iff  t  1— »  Ft(t\d,  2)  is  flat  at  (0,  t>).  When  (26)  =  0,  0  is  not  the  unique  predictor  under  pT  loss. 
Since  r  h->  Q£|x,s(t)  is  left-continuous,  for  any  sequence  r^jrwe  have  qm  =  Qe|i,!:(''"m)  T  0 
and  for  any  v  >  0,  denoting  em  =  €  —  qm 

E  [pT,m  (em  -  «)  |x,  2]  -  E  [PtL  (em)  |x,  2] 

=  v((1-t^)  P[em  <gm|x,2]-r^  P[em  >gm|i,z])  +   /      [u  -  e]<*F£m[e|x,  2]  >  0, 
v  '        J(0,v) 

since  both  of  the  terms  are  non- negative  by  the  earlier  arguments,  and  J"(0  ,  [v— e]dPem[e|x,2]  = 
/(  „+  )[v  ~  e  ~  Qm\dFe[e\x,z]  >  0,  since  0  =  Q£|i,z(t)  e  {qm,v  +  qm)  for  sufficiently  large 
m.  In  other  words,  the  last  statement  implies  that  Pe[|x,2]  has  to  assign  positive  mass  to 
(qm,  v  +  qm)  for  large  m.  In  addition,  by  arguments  like  in  (25)  for  any  v  <  0 

E  [pTin  (tm  -  v)  |x,  z]  -  E  [pT,m  (em)  |x,  2]  >  0. 

Thus,  <5,[t4,|x,2]  are  unique  best  predictors  for  for  large  m,  and  limx'  jT  Qtm{T'm\x,z]  =  0. 

Thus,  we  demonstrated  the  equivalence  of  statement  (2)  and  statement  (3).  0  is  the  unique 
(modified  by  the  limit  operation)  best  predictor  under  asymmetric  absolute  loss.  Note  that 
the  limit  operation  is  only  needed  for  at  most  countably  many  r  in  (0, 1). 

Finally,  equivalence  (3)  -»  (4)  is  obvious.  ■ 


F     Proof  of  Theorem  3 

The  proof  is  a  special  case  of  Theorem  5  in  section  I  ■ 

G     Assumptions  R.1-R.6 

The  following  assumptions  are  maintained  for  the  inverse  quantile  regression. 
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Rl   Wr  =  (Yi,Di,Xi,Zj)  are  iid  and  (Di,Xi,Zi)  take  values  in  a  compact  set. 

R2   (qt,/?t)  S   interior  V,  where  V  =  AxB  is  compact,  convex,  and  {aT,(3T)  is  unique  (a,0) 
:  E<pr(Yi  -  D[a  -  X[0)^>i  =  0,  where  tf;  =  Vt  -  \X[  :  *;]'  and  >pT(u)  =  (l(u  <  0)  -  r). 

R3  y  has  bounded  conditional  density  given  X,  D,  Z,  uniformly  over  support  of  (X,  D,  Z). 
R4  J(ir)  =  a(Q,  g,  y)£  [¥>T(V  -  £>'<*  -  A''/?  -  $'7)*]  has  full  column  rank  and  is  continu- 
ous at  each  (a,  0,^)  in  Ax  B  x  Q,  where  Q  is  an  open  ball  in  Rd,m<T'  at  zero. 
R5  Functions  (z,x)  1— >  $>(z,x)  and  (z,x)  >— »  V(z,  x)  belong  to  a  set  T    wp  — ►  1;  F  is 


set  of  boundedly  differentiate  functions  C7^ ,  with  smoothness  order  77  >  dim(z,x)/2 


16 


*(■)  -*-»  #(•),  V(-)  -=->  V(-)  £  5",  uniformly  over  compact  sets.  V(-)  >  0. 

Remark  G.l  All  assumptions,  but  R5,  are  analogous  to  the  standard  assumptions  for  quan- 
tile  regression.  They  may  be  refined  at  a  cost  of  more  complicated  notation  and  proof. 

Remark  G.2  Smoothness  in  R5  needs  to  hold  only  for  the  non-discrete  sub-component  of 
(x,  z).  As  discussed  in  the  text  condition  R5  allows  for  a  wide  variety  of  nonparametric  and 
parametric  estimators,  as  shown  by  Andrews  (1994).  Ideally,  we  would  like  to  approximate 
the  optimal  instruments  and  the  optimal  weight  as  closely  as  possible  using  non-parametric  or 
parametric  methods.  There  is  a  wide  variety  of  estimators  that  satisfy  assumption  R5,  such  as 
smooth  parametric  approximation  to  V(X,  Z)  and  $>(X,  Z)  or,  alternatively,  various  smooth 
kernel  estimators  and  smooth  series  estimators.  See  Andrews  (1994),  (1995),  Newey  (1997), 
(1990),  and  Newey  and  Powell  (1990)  for  a  catalogue  of  estimators  that  satisfy  condition  R5. 

H     Proof  of  Theorem  4 

1.  In  the  proof  IV  denotes  (Y,D,X,Z).  Define  for  0  =  (0,-y)  and  0O  =  (A-,0)  and  <pr(u)  = 
(l(u<0)  -t) 

f(W,  a,  9)  =  >fr{Y  -  D'a  -  X'0  -  $'7)f , 
f(W,  a,  9)  =  <p-r(Y  -  D'a  -  X'0  -  $'7)*, 

where  *  =  V  ■  (X' ,  *')',  *  =  *(X,  Z),  *  =  V  ■  {X',  *)',  5  ee  $(X,  Z); 

g(W,  a,  9)  =  pT(Y  -  D'a  -  X'0  -  $'f)V, 
g(W,  a,  9)  =  pT(Y  -  D'a  -  X'0  -  $'f)V, 

where  pT(u)  =  (t  —  l(u  <  0))u.  Let 

Qn(a,9)=Eng(W,a,9),  Q(a,  9)  =  Eg(W,  a,  6), 

and  for  0  ee  B  x  Q 

0(a)  =  (J3(a),  7(a))  =  arg  inf  Qn(a,  9), 

0(a)  =  (0(a),  7(q))  =  arg  inf  Q  (a,  0), 

a   =  arg  inf  ||7(a)ll,  "*  =  arg  inf  ||7(<*)ll- 


6See  page  154  in  van  der  Vaart  and  Wellner  (1996). 
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By  Theorem  1  (i),  the  true  parameters  (aT,/3T)  solve  the  equation 

EipT{Yi  -  D',ar  -  Xi/3T  -  *;0)*,  =  0. 

On  the  other  hand,  by  R3  9(a)  satisfies  the  equation: 

EtpT(Yi  -  D[a  -  X[p{a)  -  *h(Q))*i  =  0. 

We  need  to  find  a*  such  that  this  equation  holds  and  the  norm  of  7(a)  is  as  small  as  possible. 
a'  =  aT  makes  the  norm  of  7(0*)  =  0  equal  zero.  Thus  a*  =  aT  is  a  solution;  by  R2  it  is 
unique.  Additionally,  by  R2  P(a")  =  f3T. 

2.  For  each  a  and  9,  by  a  LLN  (  lemma  K.l  ) 

En  [g(W,  oc,  9)  -  g(W,  aT,  §)]  -^  E[g(W,  a,  9)  -  g(W,  aT,9)}, 

for  some  fixed  9  (the  subtraction  of  the  terms  is  to  make  the  summands  bounded  functions  of 
W).  The  lhs  is  a  finite  convex  function  in  9  and  a,  at  least  wp  — >  1.  Therefore  the  convergence 
is  uniform  over  compact  sets.  Hence  by  lemma  A.l 

9(aT)  -£->  9-r    and   9(a)  -^9T,     provided  5  -^-  aT. 

3.  3  -^->  aT.  (  shown  below). 

4.  By  the  computational  properties  of  quantile  regression  estimator  9(a„),  for  any  a„  in  a 
small  ball  at  aT 

0(K/V^)  =  V^E„f(W,an,e(an)).  (28) 

By  lemma  K.l,  the  following  expansion  of  r.h.s.  is  valid  for  any  a„  -£-»  aT:lr 

V^Enf(W,an,9(an))sG„f(W,a„,9(<Xn))  +  V^Ef(W,an,9n(c<r,)) 

=  G„f(W1  aT,9T)  +  ov(\)  +  V^Ef(W,  an,9(an)) 

Expanding  the  last  line  further 

=  G„f(W,aT,9T)  +  op(l) 

+  (Je  +  op(l))^(9(an)-8T)  (30) 

+  (Ja  +  op(l))y/n(an  -  aT). 

In  other  words  for  any  a„  — >  aT 

V^(9(an)  -  9T)  =  -Jg1G„f(W,aT,9T)  -  J^lJa[l  +  o„(l)]v/r7(a„  -  aT)  +  op(l),  i.e 

Vn(l(an)  -  0)  =  -JyGnf(W,aT,9T)  -  J7J0[1  +  op(l))^fc(a„  -  aT)  +  op(l). 
Over  a  shrinking  ball  at  qt,  denoted  Bn(aT),   wp  — ►  1,  for  11x11^  =  x'  Ax 

S  =  arg        inf        ||7(Q»)IU- 

n»EB„(or) 

Observe  that 

V^IRKJIU  =  l|Op(l)  -  J-,Ja[l  +  op(1)]v^(q„  -  aT)IU+0p(.,, 


17Note  that  by  convention  in  empirical  process  theory  Ef(W)  means  (Ef(W)),_j. 
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Since  J-,  Ja  and  A  have  full  rank,  y/n(a  —  aT)  =  Op(l).  Hence  by  lemma  A. 2 
Vn(S  -  Qx)   =  argm{\\  —  J-,Gnf(W,aT,9T)  -  J-,Jan\\A. 

where  =  means  that  the  limit  distributions  of  the  lhs  and  rhs  agree.  Conclude  that: 
V^(5  -  qt)  L=   -(J'aJ^Aj-,Jay1J^J'yAJ1Gnf(W,aT)eT)  and 
V^(8  -  8r)  L=   -Jg'll  -  Ja(J'aJyAJ.yJariJ'aJ^AJ1}Gnf(W,aT,eT) 

with  Gnf(W,ar,0r)  -^  N(Q,S)  by  CLT. 

Finally,  the  consistency  and  asymptotic  representation  for  j3  follows  analogously  to  that  of 
/3(2),  except  that  in  the  definition  of  0  instead  of  7  we  use  its  plim  0.  Therefore,  analogously 
to  (40)-  (42): 

yfiUft-flr)  L=   -Jg'lh  :  0][7  -  Ja(J'aJ^AJ1Jay1J'aJ!1AJ^]Gnf(W,aT,0T).  U 

Proof  of  step  3.  The  argument  is  just  slightly  more  complicated  than  usual  consistency 
arguments,  cf.  Amemiya  (1985)  or  Newey  and  McFadden  (1994).  For  some  9,  6(a)  maximizes 

Qn(a,9)  =  -En[g(W,a,9)  -g(W,aT,9)]  -^  Qx(a,9)  =  -E[g(W,  a,  9)  -  g(W,  aT,0)]    (31) 

where  the  convergence  is  uniform  in  (9,  a)  over  compact  sets,  using  step  2.  For  any  e  >  0, 
wp  — ►  1,  uniformly  in  a  6  A  [i]  On(a,0(a))  >  Qn(a,8(a))  by  definition,  [ii]  0=c(a,  9(a))  > 
Q„ (a, 6(a))  -  e/2  by  (31),  [hi]  Qn(a,9(a))  >  Qx(a,9(a))  -  e/2  by  (31).  Hence  wp  ->  1 

0=o  (a,  6(a))  >  Q„(a,B(a))  -  e/2  >  On  (a,  0(a))  -  e/2  >  Ooo(a,0(a))  -  e. 

Let  {B(a),a  £  A)  be  a  collection  of  balls  with  diameter  <5,  each  centered  at  9(a).  Then 
e  =  infce.4  [Qoo(0(a))  —  supese^B(o)  O°o(0)]  >  0,  by  assumption  R4  and  concavity  in  9  for 
each  a.  It  now  follows  wp  — >  1,  uniformly  in  a 

Qoo(0(a))  >  Ooo (0(a)) -Qoo (0(a))  +      sup      Q~(0(a))  =      sup      Q=°(0). 

ege\B(a)  fl€e\B(a) 

Thus  wp  — ►  1,  supQe^,  ||0(a)  -  0(a)||  <  5,  for  any  5  >  0.  This  implies  that  supae^  |||7(a)||^  - 
||7(a)||/i|  — »  0,  which  by  Lemma  A. 2  implies  3  -£-»  a*.  ■ 


I     Identification  Results:  Generalizations 

The  following  statements  and  functions  are  all  conditional  on  the  event  X  =  x.  For  notation 
sake,  we  suppress  this  conditioning.  Suppose  support  of  D  is  a  finite  set  of  discrete  values  in 
R'.  We  can  label  the  points  of  the  support  as  {1, ...,  J}.  Define  C(x)  as  the  convex  hull  of 
functions  <p  mapping  d  from  {1,  ...J}  to  (y  :  fy(y\d)  >  0)  such  that  P(Y  <  <p(D,  t)\Z)  belongs 
to  [t  —  5,t  +  <5]  a.s.  for  small  6  >  0.  Define  the  following  function 

z  h->  n2(v>,x)  =  [P[Y  <  <p(D)\zjl  1  <  j  <  J], 

where  z  =  (zj,  1  <  j  <  J).  Define,  assuming  relevant  smoothness  Jz(<£>,z)  =  -j-Tl^ip) 

fy(<p(l)\D=l,Z1)P[D  =   l\z1]         ...         fy(<p(J)\D   =   J,Z1)P[D=J\zi} 

fr(<p(l)\D=l,zj)P{D=l\zj}     ...     fY(<p(J)\D  =  J,zj)P[D  =  J\zj] 
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The  statement  that  rank  Jz(tp,x)  is  full  w.  pr.  >  0  means  that  with  positive  probability 
rank    Jzifjx)  =  J,  where  Z  =  {Zj}  are  independent  replica  of  Z,  given  X  =  x. 

Theorem  5  (Discrete  D)  Suppose  A1-A5  hold,  and  that  fy(y\d,z,x)  >  0  and  finite  over 
the  range  of  d>—>  q(d,x,r).    Then  d  i— »  q{d,x,r)  is  a  unique  solution  of 

P(Y  <  q(d,x,'r)\x,  z)  =  T  for  P-a.e.  z,  given  X  =  x,  (32) 

among  £(x)  if  for  any  <p  £  C(x)  Jz(f,  x)  is  finite  and  has  full  rank  w.  pr.  >  0. 

Proof.  Condition  on  the  event  X  =  x.  We  know  that  q(d,  x,t)  solves  (32)  from  Theorem 
1,  hence  it  belongs  to  C(x).  Suppose  there  exits  q"  6  £(x)  that  also  solves  (32)  such  that 
d  »-»  q*(d)  and  d  i— »  q(d,x,  t)  disagree  on  {1, ...,  J}.  Then 

riz(<7*,x)  =  It  and  IIz(<7,x)  =  It,    P  —  a.s., 

for  a  conformable  vector  of  l's,  1.  Then  for  any  vector  A  €  RJ  \  {0}  Taylor  expansion  gives 

A'  (Uz(q",x)  -  nz(q,x))  =  A' Jz(gJ, x)  =  0,    P  -  a.s. 

where  gj  6  'C(a;),  which  is  impossible  by  the  full  rank  assumption.  ■ 

Finally,  we  consider  continuous  D.18  As  Newey  and  Powell  (2001)  shown,  in  the  model 
E(Y  —  fi(D)\Z)  =  0,  the  condition  for  identification  of /x  is  the  Lehmann-Scheffe  completeness 
condition: 

LI   E[A(D)\z]  =  0   P  a.e.  =>  A(D)  =  0   V  a.e., 

where  V  is  a  collection  of  Fd[-|-z]  as  z  varies  over  support  of  Z  given  X  =  x.  Lehmann  (1954) 
provided  a  sufficient  "happy  family"  condition: 

L2  P[D  =  d\z]  is  a  full  rank  exponential  family  h(d)  ■  exp(r/(^)'T(d)  +  A(z)). 

The  full  rank  condition  requires  77(2)  to  vary  over  an  open  rectangle  in  Kdlm(T('i"  and  T(d)  not 
to  satisfy  a  linear  constraint.  L2  allows  for  a  broad  variety  of  non-parametric  distributions. 

The  statements  are  conditional  on  the  event  X  =  x  but  we  suppress  this  conditioning. 
Define  C(x)  as  a  convex  hull  of  functions  m  that  map  a  d  from  the  set  £>(x),  the  support  of  D, 
to  (2/ :  f(y\d)  >  0)  such  that  P[Y  <  m(D)\Z]  e  [t  -  S,  t  +  6]  a.s.,  for  small  S  >  0  given  X  =  x. 
Solution  q  is  said  to  be  unique  if  any  other  solution  m  =  q  V  —  a.e.,  where  V  is  defined  above. 

Theorem  6  Suppose  A1-A5  hold,  and  that  fy{y\d,z,x)  >  0  and  finite  over  the  range  of 
d  1— >  q(d,  x,  t),  uniformly  in  z.   Then  d  t— »  q(d,  x,  r)  is  a  unique  solution  of  (32)  among  C(x)  if 

i.  for  any  A(d)  =  m(d)  —  q(d,  x,  t)  such  that  m  £  C(x)  and  e  =  Y  —  q(d,  x,  t)  and  indepen- 
dent standard  uniform  C,    E  [fe(C,A(D)\D,  z)A(D)\z]  =  0  P-a.e.  =>  A(D)  =  0  V-a.e. 

ii.      sufficient  condition  for  i.   is  f(t,  d\z)  set-1  [P€[f.|<f,  2]  —  Pe[0|d,  z]]  ■  fo[d\z\  oc  h(d,  t)  ■ 
exp(r)(z)'T(d,t))  is  an  exponential  family  of  full  rank. 

In  the  last  expression,  /(0,  d\z)  =  lirat^o  f(t,d\z).  Condition  ii.  is  a  plausible  non-parametric 
condition,  with  rhs  motivated  as  a  Taylor  approximation  of  the  log  of  lhs. 


18To  be  removed  and  is  given  here  for  completeness.  The  full  treatment  is  to  be  given  in  the  joint 
work  with  Whitney  Newey  and  Guido  Imbens 
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Proof.   By  hypothesis  there  is  m  £  C(x)  such  that  P[Y  <  m(D)\z]  =  r    P  a.e.    Then  the 
difference  0  =  P[Y  <  m{D)\z]  —  P[Y  <  q{D)\z]  equals  by  Taylor  expansion 


jo 


EA(D)        ft(5A{D)\D,z)d8  =  EMC&(D)\D,z)[A(D)]  =  0,  (33) 

Jo 

where  £  is  a  uniform  variable  on  [0, 1],  independent  of  Z,  D  and  A  =  m—q.  (33)  is  proportional 
to  E*[A(D)\z]  where  E*  is  the  expectation  distorted  by  fe((A(D)\D,  z).  For  uniqueness  we 
need  that  (33)=0  implies  A(D)  =  0  V  -a.e.  This  proves  i..  To  prove  ii.  is  sufficient,  by  L2 
condition  ii.  implies  (also  using  f(t(d),  d\z)  =  0  <=>  fo{d\z)  =  0) 

E/(t(d),d|z)A(d)  =  0  ==►  A(rf)  =  0   P  -  a.e.  (34) 

for  any  measurable  function  t(d),  since  f(t(d),  d\z)  remains  a  full  dim(d)-rank  exponential  fam- 
ily. Thus  if  t(d)  =  A(d),  (34)  still  holds.  Now  note  f*  fY_q{D){^{D)\D,z)dC,  equals  {P[Y  - 
q(D)  <  A(D)\D,  z]  -  P[Y  -  q(D)  <  0\D,  z])/A(D),  so  Ef(Md)td\z)A{d)  =  lhs  of  (33).     ■ 


J     Additional  Results:  Empirical  Likelihood 

Here  we  treat  the  generalized  empirical  likelihood.  Define  for  $  =  (a,  /3),  ipT  (u)  =  t  —  l(u  <  0) 

7(W,i?)  s  <^T(y  -  D'q  -  A"^)f ,  where 

J  =  ^{X,  Z)  is  a  smooth  function  of  an  instrument, 

\&  =  ^(X ,  Z)  is  a  smooth  consistent  estimate  of  such  function,  satisfying  L5. 

1      " 
Define  also  Q„(t?,7)  =  ~f=  2J  s[/(W,i?)'7], 

vn  i=l 

7  =  arginfQ„(i?,7),       5  =  argsup[inf  Q„(i9,7)].  (35) 

Function  s()  equals  a  strictly  convex,  finite,  and  four  times  differentiable  function  so  on  an 
open  interval  of  K  containing  0,  and  equals  +00  outside  it.  Normalize  [9Js(t;)/9ii:'](0)  =  1  for 
j  =  1,2.  Functions  so(u)  =  —  ln(l  —  u),exp(i>),  (1  +  v)2 /2  lead  to  the  well-known  empirical 
likelihood,  exponential  tilting,  and  continuous  up-dating  GMM  estimator.  See  Imbens  (1997), 
Newey  and  Smith  (2001),  and  Kitamura  and  Stutzer  (1997). 

Just  as  GMM,  GEL  is  infeasible  in  our  settings  with  two  or  more  covariates.  It  may  be 
useful  in  low-dimensional  settings  or  as  a  refinement  of  the  IQR  estimator.  The  latter  can  be 
used  to  bring  the  estimates  to  a  right  neighborhood,  and  the  GEL  can  be  recomputed  over 
such  a  neighborhood.  For  this  purpose,  GEL  are  known  to  have  good  finite  sample  properties. 
The  pivotal  objective  function  of  GEL  can  be  used  for  construction  of  confidence  intervals. 
Assumptions  L.1-L.6  The  following  assumptions  are  maintained 

LI  Wi  =  (Yi,Di,Xi,Zi)  is  i.i.d.    and  (Di,X%,  Z,)  take  values  in  a  compact  set. 

L2  t?0  =  (aT,/3T)  G    interior  A  x  B,  a  compact  convex  set. 

L3  tfo  is  unique  1?  s.t.  E<pT(Y  -  D'a  -  X'/?)*  =  0,  in  V  =  A  x  B. 

L4  5(i?)  =  EipT(Y  -  D'a  -  X'/3)2**'  is  positive  definite  for  each  i)eV.  (S  =  S(i90)) 
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L5   {z,x)  >— >  'I'fjjX)  6  T,    wp  -•  1,  J  is  a  set  of  boundedly  differentiable  functions  CJJ,, 
with  smoothness  order  r\  >  dim(z,x)/2.19  <£(•)  — >  >£(■)  e  T,  uniformly  over  compacts. 

L6  J(-d)  =  £E  [<pT(Y  -  D'a  -  X'P)*]  =  E  [/y--D'Q-x<5(0|X,  D,  Z)^\D'  :  X'))  is  defined, 
finite,  has  full  column  rank  and  continuous  at  each  a,/?  in  A  x  B. 

Remark  J.l   All  assumptions,  but  L5,  are  fairly  standard.   L5  allows  for  a  wide  variety  of 
nonparametric  and  parametric  estimators.  See  Remark  G.2. 

Theorem  7  (GEL)    In  the  linear  model  (8)  and  assumptions  L1-L6  listed  above 


n/™7 


o,    (J's-tj)-'1  0 

o,  o  s-^s-JiJ's-tj)-1^-1 


where  S  =  r(l  -  t)EW  and  J  =  E[f,{0\X,D,Z)^[D'  :  X']},  <f  =  Y  -  D'a{r)  -  X'j3{r). 

Corollary  3  Further,  if  we  set  **  =  V*  ■  [X',  **']',  where  **  =  E[D  ■  v\Z,X]/V,  v  = 
fe[0\D,  Z,X],  and  V  =  fe(0\X,Z),  the  asymptotic  variance  of  a  and  f3,  simplifies  to  the 
efficiency  bound 

t(1-t)E[****']-:1. 

Proof.  The  proof  extends  the  arguments  of  Kitamura  and  Stutzer  (1997)  and  Christoffersen, 
Hahn,  and  Inoue  (1999).  1.     Define 

f(W,-d)  =  <pT(Y  -  D'a  -  X'p)V,    /(W,i?)  =  <pT(Y  -  D'a  -  X'/3)$, 

where  by  ^  and  *  we  denote  vectors  "if(X,  Z)  and  ty(X,  Z).  Define  also 

Q„(i?)7)  =  E„s[/(W,i?)'7],      Q(-d, 7)  =  E  s[/(W,0)'7],  and 
7(1?)  =  arginf  Q„(i?,7),    -0  =  argsupinf  Qn(i?,7), 

7(i9)  =  arginf-,  <3($,7),  i9*  =  argsup^gv  inf.,  Q($,7).  By  arguments  of  Kitamura  and  Stutzer 
(1997)  or  Newey  and  Smith,  -0*  =  i90  and  7(1?*)  =  0. 

2.  By  Lemma  K.l,  in  R,  for  any  i?„  -^->  ■do  E„s[f(W,tin)'j]  -^>  Es  [f(W,d0)'i\  for  each  7  in 
a  dense  countable  subset  of  Rdlm'7\  Hence  by  convexity  lemma  A.l,  since  Es[f(W, i?o)'7]  is 
finite  over  an  open  set  by  LI 

7(i?o)  -^  0,     7(5)  -^  0,  provided  S  -^  i90. 

3.  By  lemma  K.l  and  consistency  proof  of  Kitamura  and  Stutzer  (1997)  or  e.g.  Newey  and 
Smith  (2001)  and  references  therein  that  do  not  require  smoothness  of  /:  -d  ——*  i?o- 

4.  Step  4,  proved  below,  shows  y/nE„f(W,ti)  =  Op(l). 

5.  In  view  of  steps  2-3,  by  Lemma  K.l  and  properties  of  s,  the  following  expansion  of  the 
first  order  conditions  is  valid,  wp— >  1  for  (■y„,i9n)  =  (7, i5)  or  (7,1,  i?n)  =  (7(^0), #0), 

0  =  Vn~Enf(W,dn)s[f(W,dn)ln] 

=  [v^E„/(W,i9n)]  +  Enf(W,dn)f(W,0n)' 'vW.  +  Op(v^||7n||2)  (36) 

=    [JREnf{W,  1J0)  +  ( J  +  Op(l))V£(tfn  ~  tft))]  +  (S  +  Op(l))' V^7n  +  Op(^|]7n||2). 


9See  page  154  in  van  der  Vaart  and  Wellner  (1996). 
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By  step  4,  (36),  and  L4,  V"7n  =  Op(l),  i.e. 

v^7  =  Op(l)  and  v^7(tfo)  =  Op(l).  (37) 

and  by  (36),  (37),  and  L6, 

v^(5-i?o)  =  Op(l).  (38) 

6.  Step  6,  proved  below,  shows 

v/^J,7  =  oP(l)-  (39) 

7.  Prom  (36)  V™7  =  -S-1>/JJEn/(VMo)  -  S_1  J^n(d  -  #0)  +  op(l),  which  when  put  into 
(39)  gives  J'S-1VSE„/(W,i?o)  + J'5-1J%/S(5-i?0)  +  op(l)  =  0,  which  yields 

07(tf  -  i?o)  =  -(J'S-1  J)-1  J'S-1  v^nEn/CW.tfo)  +  op(l)  -^  iV(0,  (J'S"1  J)"1), 
n/^7  =  -[5_1  -  5_1  J(J'5_1  J)-1  J'5"1]VnE„/(W,l?0)  +  op(l) 

-^  7V(0,  S_1[S  -  JiJ'S-1^-1  J'15"1), 

and  also  jointly,  with  asymptotic  covariance  between  \fn{-d  —  -do)  and  ^/n'f  equal  0.  ■ 
Proof  of  step  4.  By  definition,  for  any  g„  =  Op(l/y/n), 

-n(E„s[7(W,  £)'(?„]  -  s(0))  <  -n(E„s[/(W,5)'7]  -  s(0)) 

<  -7i(E„s[7(VMo)'tW]  -  s(0)). 
By  Lemma  K.l,   wp  — »  1  the  following  expansions  are  valid  (by  steps  like  in  (36)) 

rhsof(40)   =^Er,[/(W,190)]7(^o)V^+^7W57(tfo)v/^ 
+  OP(n||7(tf0)||3)  =  Op(l), 
(since  ^7(^0)  =  Op(l)  and  VE"„[7(VMo)]  =  -^[/(W.tfo)]  by  Lemma  K.l  ),  and 

lhsof(40)   =^E4f(W,d)}gnV^+lV^9'r>SgnV^  +  Op(n\\g„\\3).  (42) 

By  (40)-  (42),  because  gn  =  Op(l/y/n)  is  arbitrary,  we  have  y/nE„f(W,S)  =  Op(l).  ■ 
Proof  of  step  6.  For  any  $„  =  Op{\/s/n),  by  definition 

-n(E„s[f(W,  5)'y]  -  s[0])  <  -n(E„s[7(W,  tfn)  7]  -  s[0]).  (43) 

By  Lemma  K.l  and  step  5,  the  following  expansions  (by  steps  like  in  (36))  are  valid 


lhs  of  (43)   =  -v^En7(W,  d  YiJn  -  i  V^y'Sr^  +  °p(1), 
rhs  of  (43)   =  - VnE„f(W,  -d„)'^y/n  -  \ y/nrj'S^y/n  +  op(l), 


(40) 


(44) 


(45) 


and  by  Lemma  K.l 

Vae„7(VM„)  =  VnEn/(Wr,tf„)  +  J(tf„-#)v^  +  °P(l)> 
VnE„f(W,0)  =  ^E„/(W,i?0)  +  J'(S  -  t?)v^  +  op(l). 
Putting  (43)  -  (45)  together,  we  have 

^(tf„  -  Syj'iy/Z  <  op(l).  (46) 

Because  (46)  holds  for  any  tf„  =  Op(\/y/n),  (46)  implies  J'^y/n  =  op(l).  ■ 
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K     A  Lemma 

This  lemma  uses  empirical  process  arguments  to  obtain  some  of  stochastic  relationships. 

Lemma  K.l  (Expansions)    Under  assumptions  L1-L6,  as  $n  ►  i?o,  for  any  real-valued 

function  m  that  is  Liphitz  over  the  range  of  f(W) 

i.  G„/(VM„)=Gn/(W,iM  +  op(l), 
ii.  'E„m[f(W,tfn)]  -^-»  Em[f(W,  #o)]  (  in  particular,  for  m(x)  =  xx'  etc.), 

Hi.   E„s[f(W,i!f„y-y]  >  Es[f(W,  i?o)'7]  fOT  ^ach  -y  in  a  countable  dense  subset  o/R  ""'"''. 

Under  assumption  R1-R6, 

iv.  For  each  (a,9),  E„[g(W,  a,0)  -  g(W,aT,§)]  -?->  E\g(W,aT,0)  -g(W,aT,0)]. 
v.  Gnf{W,an,8(an))  =  Gnf{W,a,,9T)  +  op(l),  for  any  an-^aT. 

Proof.  Denote  7r  =  (a,  /?,  7)  and  n  =  A  x  B  x  Q  where  Q  is  a  ball  at  0.  The  class  of  functions 

n  =  { (*,  *,  tt)  >->  yT  (Y  -  D'a  -  x'p  -  *(x,  Z)'-r)*(Jf,  Z),  tt  e  n,  *  e  F,  *  e  f} 

is  Donsker.  The  bracketing  number  of  T  by  Cor  2.7.4  in  van  der  Vaart  and  Wellner  (1996)  is 
logJVN(e,^,£2(P))  =  o(7  )=0(7 

for  some  6'  <  0.  Thus  T  is  Donsker.  By  Cor  2.7.4  in  van  der  Vaart  and  Wellner  (1996)  the 
bracketing  number  of 

X=  {(*,tt)i-  (D'a-X'P-$(X,Z)'~f),    iren,$ef} 

is  O  [logN\  |(e, T,  L,2{P))),  because  it  has  the  same  smoothness  properties.  Exploiting  the 
monotonicity  and  boundedness  of  indicator  function  and  assumptions  R4  or  L6,  the  bracketing 
number  of 


V 


=  l(4>,n)  <-><Pt(Y  -  D'a-  X'/3  -  HX^Y-y),    jr€n,$ef] 


is  O  (log  ^[.[(e,^7,  Z/2(P)))  as  well.  Therefore  V  is  Donsker.   Class  ri  is  formed  as  a  product 
of  these  two  uniformly  bounded  (by  LI  or  Rl  and  L5  or  R5)  classes: 

Ti  =  FV, 

so  the  product  is  Lipshitz  over  (T  x  V),  and  by  Theorem  2.10.6  in  van  der  Vaart  and  Wellner 
(1996)  H  is  Donsker. 

Now  we  show  i.  using  the  established  Donskerness.  Define  the  process 

h  =  (*,  1?)  h->  G„<pT(Y  -  D'a  -  X'/3)#(X,  Z). 

Since  \t  — — »  ^>o  uniformly  over  compacts  and  i?n  -^->  -do,  we  have  p(h,  h)  — >  0,  where  p  is 
denned  by  the  L2(P)  seminorm  p[h)  =  E\\ipT(Y  -  D'a  -  X'/3)*(X,  Z)||,  so  that 

Gn<pT(Y  -  D'an  -  X'/9„)*(X,  Z)  -  Gn<pT{Y  -  D'a  -  X'/3)*(X,  Z)  =  op(l) 
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By  the  above  analysis  G„Tn[f(W,-&n)]  is  Donsker  (asymptotically  Gaussian)  as  well,  using 
Theorem  2.10.6  in  van  der  Vaart  and  Wellner  (1996)  (m  is  Lipshitz  over  bounded  subsets  to 
which  f(W,  $n)  belongs   wp  — >    1  by  assumption.)  From  this  ii.  is  immediate. 

The  proof  of  v.  and  iv.  follows  exactly  as  i.  and  ii.,  respectively,  using  that  H  is  Donsker. 

To  show  iii.  note  that  7  is  either  such  that  Es[f  (W,  d0)' f]  =  +°o  or  Es[f(W,  ■3o)'i\  < 
00.  By  convexity  and  lower-semicontinuity  of  s,  the  latter  set,  say  F,  is  convex,  open,  and 
its  boundary  is  nowhere  dense  in  Kd,m<"').  Thus  for  7  g  F,  Es[f(W,ti)'-y]\v_vf_f  <  00, 

wp  — >  1.  Conditional  on  this  event,  step  ii.  gives  E„s[f(W,  ■d-nY'y]  -^-*  Es[f{W,  tioYl]  <  00. 
Similarly  take  7  in  Fc,  where  F  denotes  the  closure  of  F.  The  analogous  argument  delivers, 
Ens[f(W,  i?n)'7]  -^-»  Es[f(W,  i?o)'7]  =  00.  So  iii.  follows  by  taking  all  the  rationals  not  in  the 
boundary  of  F.  ■ 
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